deriving task-oriented decision structures from …

118
DERIVING TASK-ORIENTED DECISION STRUCTURES FROM DECISION RULES Ibrahim M. Fahmi Imam MLI 95·7 October 1995

Upload: others

Post on 28-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

DERIVING TASK-ORIENTED

DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam

MLI 95middot7

October 1995

DERIVING TASK-ORIENTED DECISION STRUCTURES FROM DECISION RULES

Adissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University

By

Ibrahim M Fahmi Imam

Director Professor Ryszard S Michalski

PRC Chaired Professor of Computer Science amp Systems Engineering School of Information Technology and Engineering

George Mason University

Fall Semester 1995 George Mason University Fairfax Virginia 22030

copy 1995 Copyright by Ibrahim F Imam All rights reserved

ACKNOWLEDGMENT

I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer

Science and Engineering and my Dissertation Director for his support encouragement and

guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the

Department of Information Systems and Software Engineering Professor David Rine Professor

of Information Technology and Engineering and Professor David Schum Professor of Excellence

in Information Technology and Engineering for their encomagement and help with many aspects

of my PhD

I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing

me with application problems Ronny Kohavi Stanford University for discussion and providing

me with some related work on learning decision structures and decisionmiddot graphs and Professor

George Tecuci Computer Science Department for pointing some related work

I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric

Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in

my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for

reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman

for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which

made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and

comparison of different aspects of my work and Janusz Wnek for using his DIAV program for

explaining my results

I would like to thank Professor Andrew P Sage Dean of the School of Information Technology

and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe

President of George Mason University for their support and Professor Murray W Black

Associate Dean of the School of Infonnation Technology and Engineering for guidance on

preparing the PhD proposal

I would like to thank conferences organizers who supported me to attend their conferences and

present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank

Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I

would like also to thank the organizing committee of the Florida Artificial Intelligence Research

Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr

Howard Hamilton Dr John Stewman and Dr Dan Tamir

I would like also to thank the many individuals who helped me in any way during my PhD Those

include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr

Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla

Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner

Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea

Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos

Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 2: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

DERIVING TASK-ORIENTED DECISION STRUCTURES FROM DECISION RULES

Adissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University

By

Ibrahim M Fahmi Imam

Director Professor Ryszard S Michalski

PRC Chaired Professor of Computer Science amp Systems Engineering School of Information Technology and Engineering

George Mason University

Fall Semester 1995 George Mason University Fairfax Virginia 22030

copy 1995 Copyright by Ibrahim F Imam All rights reserved

ACKNOWLEDGMENT

I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer

Science and Engineering and my Dissertation Director for his support encouragement and

guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the

Department of Information Systems and Software Engineering Professor David Rine Professor

of Information Technology and Engineering and Professor David Schum Professor of Excellence

in Information Technology and Engineering for their encomagement and help with many aspects

of my PhD

I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing

me with application problems Ronny Kohavi Stanford University for discussion and providing

me with some related work on learning decision structures and decisionmiddot graphs and Professor

George Tecuci Computer Science Department for pointing some related work

I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric

Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in

my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for

reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman

for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which

made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and

comparison of different aspects of my work and Janusz Wnek for using his DIAV program for

explaining my results

I would like to thank Professor Andrew P Sage Dean of the School of Information Technology

and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe

President of George Mason University for their support and Professor Murray W Black

Associate Dean of the School of Infonnation Technology and Engineering for guidance on

preparing the PhD proposal

I would like to thank conferences organizers who supported me to attend their conferences and

present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank

Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I

would like also to thank the organizing committee of the Florida Artificial Intelligence Research

Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr

Howard Hamilton Dr John Stewman and Dr Dan Tamir

I would like also to thank the many individuals who helped me in any way during my PhD Those

include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr

Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla

Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner

Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea

Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos

Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 3: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

copy 1995 Copyright by Ibrahim F Imam All rights reserved

ACKNOWLEDGMENT

I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer

Science and Engineering and my Dissertation Director for his support encouragement and

guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the

Department of Information Systems and Software Engineering Professor David Rine Professor

of Information Technology and Engineering and Professor David Schum Professor of Excellence

in Information Technology and Engineering for their encomagement and help with many aspects

of my PhD

I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing

me with application problems Ronny Kohavi Stanford University for discussion and providing

me with some related work on learning decision structures and decisionmiddot graphs and Professor

George Tecuci Computer Science Department for pointing some related work

I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric

Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in

my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for

reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman

for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which

made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and

comparison of different aspects of my work and Janusz Wnek for using his DIAV program for

explaining my results

I would like to thank Professor Andrew P Sage Dean of the School of Information Technology

and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe

President of George Mason University for their support and Professor Murray W Black

Associate Dean of the School of Infonnation Technology and Engineering for guidance on

preparing the PhD proposal

I would like to thank conferences organizers who supported me to attend their conferences and

present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank

Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I

would like also to thank the organizing committee of the Florida Artificial Intelligence Research

Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr

Howard Hamilton Dr John Stewman and Dr Dan Tamir

I would like also to thank the many individuals who helped me in any way during my PhD Those

include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr

Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla

Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner

Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea

Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos

Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 4: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

ACKNOWLEDGMENT

I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer

Science and Engineering and my Dissertation Director for his support encouragement and

guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the

Department of Information Systems and Software Engineering Professor David Rine Professor

of Information Technology and Engineering and Professor David Schum Professor of Excellence

in Information Technology and Engineering for their encomagement and help with many aspects

of my PhD

I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing

me with application problems Ronny Kohavi Stanford University for discussion and providing

me with some related work on learning decision structures and decisionmiddot graphs and Professor

George Tecuci Computer Science Department for pointing some related work

I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric

Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in

my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for

reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman

for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which

made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and

comparison of different aspects of my work and Janusz Wnek for using his DIAV program for

explaining my results

I would like to thank Professor Andrew P Sage Dean of the School of Information Technology

and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe

President of George Mason University for their support and Professor Murray W Black

Associate Dean of the School of Infonnation Technology and Engineering for guidance on

preparing the PhD proposal

I would like to thank conferences organizers who supported me to attend their conferences and

present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank

Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I

would like also to thank the organizing committee of the Florida Artificial Intelligence Research

Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr

Howard Hamilton Dr John Stewman and Dr Dan Tamir

I would like also to thank the many individuals who helped me in any way during my PhD Those

include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr

Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla

Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner

Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea

Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos

Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 5: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

Associate Dean of the School of Infonnation Technology and Engineering for guidance on

preparing the PhD proposal

I would like to thank conferences organizers who supported me to attend their conferences and

present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank

Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I

would like also to thank the organizing committee of the Florida Artificial Intelligence Research

Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr

Howard Hamilton Dr John Stewman and Dr Dan Tamir

I would like also to thank the many individuals who helped me in any way during my PhD Those

include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr

Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla

Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner

Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea

Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos

Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 6: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

~J I yarJ I JJ I ~

U-t-oJ ~II ~) ~ oWAIl J

Dedication

To my mother my brothers and my sister

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 7: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

TABLE OF CONTENTS

TI1LE Page

ABS1RAcr II 1

CHAPrER 1

IN1RODUCI10N II 3

11 Motivation and Overview 3

12 The Problem Statement 6

CHAPrER2

REIAfED RESEARCFI 7

21 Learning Decision Trees from Decision Diagrams 7

22 Learning Decision Trees from Examples 10

221 Building Decision Trees Using Information-based Criteria 11

222 Building Decision Trees Using Statistics-based Criteria 16

223 Analysis of Attribute Selection Criteria 18

23 Learning Decision Structures 19

CHAPrER3

DESCRIPTION OF TIlE APPROACFI 23

31 General Methodology 23

32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25

33 Generating Decision Structures From Decision Rules 28

331 The AQDT-2 attribute selection method 29

332 The AQDT-2 algoridlm 37

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 8: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

vi

333 An example illustrating the algorithm 42

34 Tailoring Decision Structure to a decision-making situation 47

341 Learning Cost-Dependent Decision Structures 49

342 Assigning Decision Under Insufficient Infonnation 49

343 Coping with noise in training data 50

35 Analysis of the AQDT-2 Attribute Selection Criteria 51

36 Decision Structures vs Decision Trees 53

CHAPTER 4

EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58

41 Description of the Experimental Analysis 59

42 Experiments With Average Size Complex and Noise-Free Problems

Wind Brac-ings 60

43 Experiments With Small Size Simple and Noise-Free Problems

e MONK-I 69

44 Experiments With Small Size Complex and Noise-Free Problems

MONK-2 76

45 Experiments With Small Size Simple and Noisy Problems

MONK-3 79

46 Experiments With Large Size Complex and Noise-Free Problems

Diagnosing Breast Cancer 83

47 Experiments With Large Size Complex and Noisy Problems

Musm()()m classifications 84

48 Experiments With Small Size Structured and Noise-Free Problems

East-West Trains 85

49 Experiments With Small Size Simple and Noisy Problems

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 9: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

vii

Congressional Voting Records (1984) 87

410 Analysis of the Results 88

ClIAPfER 5 CONCIUSIONS 95

51 Summary 95

52 Contributions 96

REFERENCES 98

VITA 102

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 10: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

viii

LIST OF TABLES

No TITLE Page

2-1 An example of a decision table 9

2-2 A set of training examples used to illustrate the C45 system 15

2-3 The frequency of different attributes values to different decision classes 17

2-4 The expected values of the frequency of examples in Table 2-3 17

2-5 Attribute selection criteria and their basic evaluation measure 17

2-6 The contingency tables of Mingers example 18

2-7 Mingers results for determining the goodness of split 19

2-8 Mingers results for comaring the total accuracy and size of decision trees

provided by different attribute selection criteria from four problems 19

2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22

3-1 The available tools and the factors that affect the process of testing a software 43

3-2 Calculating the disjointness of each attribute 44

3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51

3-4 The data used in Mingers f11st experiments 52

3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52

3-6 The possible ranking domains and using condition of AQDT-2 criteria 53

3-7 Comparison between Decision Structures and Decision Tree 54

4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62

4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67

4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71

4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 11: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …

ix

4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77

4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81

4-7 The set of attributes and their values used in the trains problem 86

4-8 A tabular snmmary of the predictive accuracy of decision trees obtained

by AQDT -2 and C45 for the congressional voting data 88

4-9 Summary of the best parameter settings for the first subfunction of the approach

with different data characteristics 89

4-10 Summary of the performance of AQDT-2 and C45 on different problems 90

x

LIST OF FIGURES

No TITLE Page

2-1 An example to illustrate how attributes break rules 8

2-2 A decision tree learned from the decision table in Table 2-1 10

2-3 A decision tree learned using the gain criterion for selecting attributes 15

2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21

3-1 Architecture of the AQDT approach 24

3-2 A ruleset generated by AQ15 for the concept Voting pattern of

Democratic Representatives 27

3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33

3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33

3-5 Decision trees showing the maximum number of none leaf nodes 41

3-6 Decision rules for selecting the best tool for testing a software 43

3-7 A decision structure learned for classifying software testing tools 45

3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46

3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47

3-10 A decision tree learned without the cost attribute 47

3-11 Decision structures learned by AQDT-2 using different criteria 55

3-12 The Imams example Example where learning decision structures (trees)

from rules is better than learning them from examples 56

3-13 An example where decision rules are simpler than decision trees 57

4-1 Design of a complete experiment 59

Xl

4-2 Decision rules determined by AQ15c from the wind bracing data 61

4-3 A decision tree learned by C45 for the wind bracing data 63

4-4 A decision structure learned from AQ15c wind bracing rules 64

4-5 A decision structure that does not contain attribute xl 64

4-6 A decision structure without Xl with candidate decisions assigned to leaves 65

4-7 A decision structure determined from rules in Figure 4-4 under the

assumption of 10 classification error in the training data 65

4-8 Diagramatic visualization of decision trees learned for different decision

making situations for the wind bracing data 66

4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings

for the wind bracing PIoblem 68

4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69

4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69

4-12 A visualization diagram of the MONK-l problem 70

4-13 Decision rules learned by AQ15c for the MONK-1 problem 71

4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72

4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72

for the MONK-l Problem 74

4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75

4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75

4-19 A visualization diagram of the MONK-2 problem 76

for the MONK-2 Problem 78

4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79

xii

4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79

4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings

4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing

4-23 A visualization diagram of the MONK-3 problem 80

for the MONK-3 Problem 82

4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82

4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83

4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84

4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85

4-29 Decision structures learned by AQDT-2 for different decision-making situations 87

4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88

4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91

4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92

4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93

4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94

the generalization degree to 1 94

DERIVING TASKmiddotORIENTED DECISION STRUCTURES

FROM DECISION RULES

Ibrahim M Fahmi Imam PhD

School of Information Technology and Engineering

Geotge Mason University Fall 1995

Ryszard S Michalski Advisor

ABSTRACT

This dissenation is concerned with research on learning task-oriented decision structures from

decision rules The philosophy behind this research is that it is more appropriate to learn

knowledge and store it in a declarative form and then when a decision making situation occurs

generate from this knowledge the decision structure that is most suitable for the given decision

making situation Learning decision structures from decision rules was first introduced by

Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski

(1993 ab) called AQDT-l

This approach separates the function of generating a knowledge-base from the function of using

the knowledge-base for decision-making The first function focuses on learning accurate

consistent and complete concept description expressed in a declarative form The second function

is performed whenever a new decision-making situation occurrs a task-oriented decision structure

is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted

for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam

1994)

The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from

decision rules or examples Each decision making situation is defined by a set of parameters that

controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show

that decision structures learned by it usually outperform in terms of accuracy and average size of

the decision structures those learned from examples by other well known systems The results

show also that the system does not work very well with noisy data The system is illustrated and

compared using applications of artificial problems such as the three MONKs problems (Thrun

Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also

applied to real world problems of learning decision structures in the areas of construction

engineering (for determining the best wind bracing design for tall buildings) medical

diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for

learning classification rules for distinguishing between poisonous and non-poisonous

mushrooms) and political data (for characterizing democratic and republican voting records)

CHAPTER 1 INTRODUCTION

11 Motivation and Overview

Learning and discovery systems should be able not only to generate and store knowledge but also

to use this knowledge for decision-making The main step in the development of systems for

decision-making is the creation of a knowledge structure that characterizes the decision-making

process The form in which knowledge can be easily obtained may however differ from the form

in which it is most readily used for decision-making It is therefore important to identify the form

of knowledge representation that is most appropIiate for learning (eg due to ease of its

modification) and the form that is most convenient for decision making

A simple and effective tool for describing decision processes is a decision structure which is a

directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to

arrive at a decision about that object The nodes of the structure are assigned individual tests

(which may correspond to a single attribute a function of attributes or a relation) the branches are

assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific

decision a set of candidate decisions with corresponding probabilities or an undetermined

decision A decision structure reduces to a familiar decision tree when each node is assigned a

single attribute and has at most one parent when the branches from each node are assigned single

values of that attribute and when leaves are assigned single definite decisions Thus the problem

of generating a decision structure is a generalization of the problem of generating a decision tree

Decision trees are typically generated from a set of examples of decisions The essential

characteristic of any such method is the attribute selection criterion used for choosing attributes to

be assigned to the nodes of the decision tree being built Such criteria include the entropy

reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman

et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)

3

4

Adecision treedecision structure representations can be an effective tool for describing a decision

process as long as all the required tests can be performed and the decision-making situations it

was designed for remain constant (eg in a doctor-patient example the doctor should detennine

that answers for all symptoms appear in the decision tree) Problems arise when these assumptions

do not hold For example in some situations measuring certain attributes may be difficult or costly

(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the

tools needed are not available) In such situations it is desirable to refonnulate the decision

structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the

root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far

away from the root) If an attribute cannot be measured at all it is useful to either modify the

structure so that it does not contain that attribute or-when this is impossible-to indicate

alternative candidate decisions and their probabilities A restructuring is also desirable if there is a

significant change in the frequency of occurrence of different decisions (eg in the doctor-patient

example the doctor may request a decision structure expressed in a specific set of symptoms

biased to classify one or more diseases or specify a certain order of testing)

Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite

difficult This is because a decision structure is a procedural representation that imposes an

evaluation order on the tests In contrast no evaluation order is imposed by a declarative

representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any

order Thus for a given set of rules one can usually build a huge number of logically equivalent

decision structures (trees) which differ in the test ordering Due to the lack of order constraints

a declarative representation (rules) is much easier to modify to adapt to different situations than a

procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a

decision one needs to decide in which order tests are evaluated and thus needs a decision

structure

An attractive solution to these opposite requirements is to acquire and store knowledge in a

declarative form and transform it to a decision structure when it is needed for decision-making

5

This method allows one to create a decision structure that is most appropriate in a given decisionshy

making situation Because the number of decision rules per decision class is usually small (each

rule is a generalization of a set of examples) generating a decision structure from decision rules

can potentially be perfonned much faster than by generation from training examples Thus this

process could be done on line without any delay noticeable to the user Such virtual decision

structures are easy to tailor to any given decision-making situation

This approach allows one to generate a decision structure that avoids or delays evaluating an

attribute that is difficult to measure in some decision-making situation or that fits well a particular

frequency distribution of decision classes In other situations it may be unnecessary to generate

complete decision structure but it may be sufficient to generate only a part of it which concerns

only with decision classes of interest Thus such an approach has many potential advantages

This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy

oriented decision structure (decision structure that is adapted to the given decision-making

situation) from decision rules The decision rules are learned by either rule leaming system AQ15

(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction

capabilities (Bloedorn et al 1993)

To associate the decision rules with a given decision-making task AQflf-2 provides a set of

features including 1) enabling the system to include in the decision structure nodes corresponding

to new attributes constructed during the process of learning the decision rules 2) controlling the

degree of generalization needed during the development of the decision structure 3) providing four

new criteria for selecting an attribute to be a node in the decision structure that allow the system to

generate many different but equivalent decision structures from the same set of rules 4) generating

unknown nodes in situations when there is insufficient infonnation for generating a complete

decision structure 5) learning decision structures from discriminant rules as well as

characteristic rules and 6) providing the most likely decision when the decision process stops

due to inability to evaluate an attribute associated with an intennediate node

6

To test the methooology of generating decision structures from decision rules an extensive set of

planned experiments have been designed to test different aspects of the approach The experiments

include testing different combinations of parameters for each sub-function of the approach

analyzing the relatioship between decision rules and the decision structure learned from them and

comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)

system for learning decision trees from examples Different experiments were designed to examine

the new features of the AQUf approach for learning task-oriented decision structures The

experiments were applied to artificial domains as well as real-world domains including MONK-I

MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al

1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast

Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The

MONKs problems are concerned with learning classification rules for robot-like figures MONK-l

requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description

(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns

learning a DNF rule from noisy data The East-West trains dataset is a structural domain that

classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind

bracing data involves learning conditions for applying different types of wind bracing for ta11

buildings The Mushrooms data is concerned with learning classification rules for distinguishing

between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning

concept descriptions for recognizing breast cancer The congressional voting data includes voting

records on different issues AQUf2 outperformed C45 on the average with respect to both the

predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy

data or with problems that have many rules covering very few examples

12 The Problem Statement

There are many limitations and problems accompanied using decision trees for decision-making

CHAPTER 2 RELATED RESEARCH

21 Learning Decision Trees from Decision Diagrams

The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which

introduced an algorithm for generating decision trees from decision lists The method proposed

several attribute selection criteria These criteria are of increasing power of the main criterion

order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two

specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree

based on properties extracted from the decision diagram In order to better explain the method

it is necessary to define some terms These terms will be used later in the dissertation

Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all

examples of class C and does not include any examples of other decision classes

Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically

disjoint In other words for any two rules there exists a condition with the same attribute but

with different values in each rule

Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules

among all possible covers

Definition 2-4 A diagram for a given cover is a table constructed graphically by representing

in two dimension space of all possible combination of attributes values and locating on the

diagram all the condition parts of the given rules and marking them with the action specified

with each rule

Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be

consistent and complete The method is based on the fact that if there are n decision classes

any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal

7

8

decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has

shown that if only one rule is broken by a selected attribute then instead of having one leaf

(which could potentially represent this rule or the decision class in the tree) there will have to

be at least two leaves representing this rule in the fmal decision tree

The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do

not break any rules or break as few as possible An attribute breaks a rule if the attribute can

divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules

In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp

[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]

amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp

[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule

Figure 2-1 An example to illustrate how attributes break rules

One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each

attribute an integer equal to the number of rules broken by that attribute This criterion is also

called the static cost estimate of an attribute or the criterion of minimizing added leaves

(MAL)

-

C3

9

The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the

estimated number of additional nodes in the decision tree being generated over a hypothetical

minimal decision tree When there is a tie between two attributes the attribute to be selected is

the one which breaks smaller rules (rules that cover fewer examples or more specialized

rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)

Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion

(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but

is more complex because once an attribute is selected as a node in the tree some rules andor

parts of the broken rules at each branch are merged into one rule The DMAL ensures that the

value of the total cost estimate of an attribute is decreased by a value equal to the number of

merged rules minus one

Example Learn a decision tree from the following decision table

The minimal cover consists of the following rules

Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]

The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5

for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two

leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of

10

the decision tree is x2 Then three branches are attached to the root node and the decision rules

are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is

generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the

minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion

Al A3 A2

Figure 2-2 A decision tree learned from the decision table in Table 2-1

22 Learning Decision Trees from Examples

Decision tree learning is a field that concerned with generating decision tree that classifies a set

of examples according to the decision classes they belong to The essential aspect of any

inductive decision tree method is the attribute selection criterion The attribute selection

criterion measures how good the attributes are for discriminating among the given set of

decision classes The best attribute according to the selection criterion is chosen to be assigned

to a node in the tree The fIrst algorithm for generating decision trees from examples was

proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer

algorithm for building decision trees This algorithm has been subsequently modified by

Quinlan (1979) and applied by many researchers to a variety of learning problems

Attribute selection criteria can be divided into three categories These categories are logicshy

based information-based and statistics-based The logic-based criteria for selecting attributes

use logical relationships between the attributes and the decision classes to determine the best

attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves

(Michalski 1978) which uses conjunction and disjunction operators The information-based

criteria are based on the information theory These criteria measure the information conveyed

11

by dividing the training examples into subsets Examples of such criteria include the

information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979

83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and

others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The

statistics-based criteria measure the correlation between the decision classes and the attributes

These criteria use statistical distributions for determining whether or not there is a correlation

The attribute with the highest correlation is selected to be a node in the tree Examples of

statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984

Mingers 1989a)

Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the

method of learning decision trees to also handle data with noise (by pruning) Handling noise

extended the process of learning decision trees to include the creation of an initial complete

decision tree tree pruning which is done by removing subtrees with small statistical Validity

and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used

for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)

Pruning decision trees improves their simplicity but reduces their predictive accuracy on the

training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy

value problem by exploring probabilities of an example belonging to different classes

The rest of this section includes a brief description of the attribute selection criterion used by

C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting

an attribute to be a node in the tree Also the section includes a brief description of the Chishy

square method for attribute selection (Mingers 1989a) The method is a statistics-based

method for selecting an attribute to be a node in the tree

221 Building Decision Trees Using htformation-based Criteria

12

This section presents a description of the inductive decision tree learning system C4S The

C4S learning system is considered to be one of the most stable accurate and fastest program

for learning decision trees from examples

Learning decision trees from examples requires a collection of examples Each example is

represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning

program that induces classification decision trees from a set of given examples The C4S

learning system is descended from the learning system ID3 (Quinlan 1979) which is based on

Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)

The C45 system uses an attribute selection criterion called the Gain Ratio This criterion

calculates the gain in classifying information based on the residual information needed to

classify cases in a set of training examples and the information yielded by the test based on the

relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is

based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses

the frequency of each decision class in the given set of training examples

Once an attribute is chosen to be a node in the tree the system generates as many links as the

number of its values and classifies the set of examples based on these values If all the

examples at a certain node belong to one decision class the system generates a leaf node and

assigns it to that class Otherwise the system searches for another attribute to be a node in the

tree

The Gain Criterion The gain criterion is based on the information theory That is the

information conveyed by a message depends on its probability and can be measured in bits as

minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for

a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is

any set of cases and T is the initial set of training cases The frequency of class C1 in the set S

is the number of examples in S that belong to class C i bull

13

freq(Cit S) = Number of examples in S belong to CI (2-1)

Suppose that lSI is the total number of examples in S the probability that an example selected

at random from S belongs to class Ci is freq(CIbullS) IISI

The information conveyed by the message that a selected example belongs to a given decision

class Cit is determined by -log1 (freq(Ch S) IISI) bits

The expected information from such a message stating class membership is given by

info(S) = -k

L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I

info(S) is known also as the entrQpy of the set S When S is the initial set of training examples

info(T) determines the average amount of information needed to identify the class of an

example in T

Suppose that we selected an attribute X to be the root of the tree and suppose that X has k

possible values The training set T will be divided into k subsets each corresponding to one of

Xs values The expected information of selecting X to partition the training set T infoz(T) can

be found as the sum over all subsets of multiplying the information conveyed by each subset

by its probability

k

infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I

The information gained by partitioning the training examples T into subset using the attribute

X is given by

gain(X) =inlo(T) - inoIT) (2-4)

The attribute to be selected is the attribute with maximum gain value

The Gain Ratio Criterion This criterion indicates the proportion of information generated by

the split that appears helpful for classification Quinlan (1993) pointed out that the gain

criterion has a serious deficiency Basically it is strongly biased toward attributes with many

outcomes (values) For example for any data that contains attributes such as social security

14

number the gain criterion will select that attribute to be the root of the decision tree However

selecting such attributes increases the size of the decision tree Quinlan provided a solution to

this problem by introducing the gain ratio criterion which takes the ratio of the information that

is gained by partitioning the initial set of examples T by the attribute X to the potential

information generated by dividing T into n subsets

Following similar steps to obtain the information conveyed by dividing T into n subsets The

expected information generated by dividing T into n subsets and by analogy to equation 2-2 is

determined by

split info(T) = - Lt

(ITII lTI) 10gl (ITII m ) (2-5) I

The gain ratio is given by

gain ratio(X) = gain(X) I split info(X) (2-6)

and it expresses the proportion of information generated by the split that is useful for

classification

Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the

set of training examples

First determine the amount of information gained by selecting the attribute outlook to be a

root of the decision tree This attribute divides the training examples into three subsets

sunny-with five examples two of which belong to the class Play overcast-with four

examples all of which belong to the class Play and rain-with five examples three of

which belong to the class Play To determine info(11 the average information needed to

identify the class of an example in T there are 14 training examples and two decision classes

Nine of these examples belong to the class Play and five belong to the class Dont Play

Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits

When using outlook to divide the training examples the information becomes

15

infO(T) = 514 -25 log2 (25) - 35 10g2 (35)

+ 414 -44 10g2 (44) - 04 log2 (04)

+ 514 -35 logl (35) - 25 logl (25) = 0694 bits

By substituting in equation 2-4 the gain of information results from using the attribute

outlook to split the training examples equal to 0246 The gain information for windy is

0048

Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split

information for outlook is determined as follows

split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits

The gain ratio for outlook = 02461577 =0156

16

Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes

The C45 system handles discrete values as well as continuous values To handle an attribute

with continuous values C45 uses a threshold to transform the continuous domain into two

intervals In other words for each continuous attribute C45 generates two branches one

where the values of that attribute is greater than the determined threshold and the other if the

value is less than or equal to the threshold

Tree pruning in C45 is a process of replacing subtrees with small classification validity by

leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees

This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is

the number of misclassified examples at a given leaf

222 Building Decision Trees Using Statistics-based Criteria

The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)

in building decisionmiddot trees The method uses Chi-square statistics to measure the association

between two attributes When building decision trees the method is implemented such that it

determines the association between each attribute and the decision classes The attribute to be

selected is the one with greatest value

To determine the Chi-square value for an attribute consider aij is the number of examples in

class number i where the attribute A takes value numberj In other words aij is the frequency

of the combination of decision class number i and the attribute value number j The Chi-square

value for attribute A is given by

n m

Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy

i=1 j=1

where n is the number of decision classes m is the number ofvalues of a given attribute Also

17

Eij = (TCi TVj ) T (2shy

8)

where T Ci and T Vj are the total number of examples belong to the decision class a and the total

number of examples where the attribute A takes value vj respectively T is the total number of

examples

Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different

combination of values between the decision class and both the outlook and the Windy

attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3

of different attributes values for different decision classes

To determine the association value between the decision classes and both the attribute Windy

and the attribute Outlook the observed Chi-square values are

Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]

=021 + 039 + 025 + 025 = 11

Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]

+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362

18

Applying the same method to the other attributes the results will favor the attribute Outlook

Once that attribute is selected to be a node in the tree the remaining set of examples are divided

into subsets and the same process is repeated on each subset

Table 2-5 shows a summary of these criteria and their basic evaluation function

Table 2-5 Attribute selection criteria and their basic evaluation measure

Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square

II

Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I

G-Statistics =2N 1M (N-No of examples) n m

Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l

223 Analysis of Attribute Selection Criteria

This subsection introduces briefly the analysis of different selection criteria which was done by

Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision

tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin

index of diversity Marshal correction and Gain RaJio The overall results show that the Gain

Ratio criterion had the strongest xesults

In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples

(ie examples may belong to more than one decision class) to observe how the selected criteria

evaluate the given attributes The problem has two decision classes and two attributes X and

Y It was assumed that attribute X is better for classifying the examples than attribute Y The

training examples were unevenly spread between the two values of X Attribute Y has three

values and the examples were spread randomly among them Table 2-6 (a and b) shows the

contingency tables for both attributes Table 2-7 shows a summary of the goodness of split

provided by the six criteria Mingers noted that the measures that are not based on information

19

theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst

row of radiation has a greater influence in the log calculation In the case of using the Chishy

square criterion the value zero adds the maximum association between any two attributes

because the Chi-square value of a zero cell is the expected value of this cell

b)

Now let us demonstrate results from another experiment done by Mingers In this experiment

Mingers used four different data sets to generate decision trees for eleven different criteria In

the final results he compared the total number of nodes and the total error rate provided by

each criterion over all given problems Table 2-8 shows the final results for five selected

criteria only

20

U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU

This experiment was performed on four real world data sets These data are concerned with

profiles of BA Business Studies degree students recurrence of breast cancer classifying types

of Iris and recognizing LCD display digits The data was divided randomly 70 for training

and 30 for testing For more details see Mingers (1989a)

23 Learning Decision Structures

Considering the proposed defInition of decision structures given above two related lines of

research are described in this section Gaines (1994) and Kohavi (1994) proposed two

approaches for generating decision structures that share some of the earlier ideas by Imam and

Michalski (1993b)

In the first approach Brian Gaines introduces a method for transforming decision rules or

decision trees into exceptional decision structures The method builds an Exception Directed

Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning

either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion

to the root node it places it on a temporary conclusion list Then it generates a new child node

and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the

conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if

the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new

conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules

that have common conditions with the rule at the root are evaluated The method then creates a

21

new child node from the root and repeat the process until all rules are evaluated In the decision

structure nodes containing rules only represent common conditions to all of its children

The main disadvantages of this approach is that it requires discriminant rules to build such a

decision structure Also such a structure is more complex than the traditional decision trees

that are used for decision-making

Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure

The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and

either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost

The second approach introduced by Ronny Kohavi learns decision structures from examples

using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing

Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a

decision graph where each attribute occurs at most once along any computational path In other

words for each path from the root to the leaves of the decision structure attributes may occur

as a node once at most However there may be more than one node with the same attribute in

the decision graph Kohavi also defined a leveled decision graph in which the nodes are

partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level

terminate at the next level An oblivious decision graph is a decision graph where all nodes at

a given level are labeled by the same attribute

22

Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]

Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]

The algorithm starts by generating a leaf node for each decision class Then it applies a

nondeterministic method to select an attribute to build new level of the decision structure This

attribute is removed from the data and the data is divided into subsets each corresponding to a

complex combination of that attributes values For each subset the process is repeated until all

the examples of a given subset belong to one decision class For example if the selected

attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The

data is divided into two subsets the fIrst subset contains examples where A takes value 0 and

belong to class CO or takes value 1 and belong to class Cl The second subset of examples is

the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO

The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or

equal to ka where k is the number of decision classes and n is the number of values of the

selected attribute and it can increase exponentially before that number is reduced

exponentially to one

It is easy for the reader to fIgure out some major disadvantages of such approach including

The average size of such decision structures is estimated to be very large especially when there

23

is no similarity (ie strong patterns) or logical relationship in the data The time used to learn

such decision structure is relatively very high comparing to systems for learning decision trees

from examples and fmally it could be better to search for attribute which reduces the number

of generated subsets of the data instead of nondeterministically selecting an attribute to build a

new level of the decision structure Kohavi provided a comparison between the C45 system

and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a

deterministic method for attribute selection which minimizes the width of the penultimate level

of the graph

Table 2-9 shows a comparison between the proposed approach and those two approaches The

EDAG and HOODG systems are unreleased prototype systems

or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand

CHAPTER 3 DESCRIPTION OF THE APPROACH

31 General Methodology

In the proposed approach the function of learning or discovery is separated from the function of

using the discovered knowledge for decision-making The frrst function is performed by an

inductive learning program that searches for knowledge relevant to a given class of decisions

and stores the learned knowledge in the form of decision rules The second function is performed

when there is a need for assigning a decision to new data points in the database (eg bull a

classification decision) by a program that transforms obtained knowledge into a decision

structure optimized according to the given decision-making situation

The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal

The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation

The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17

(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of

23

24

declarative knowledge are that they do not impose any order on the evaluation of the attributes

and due to the lack of order constraints decision rules can be evaluated in many different

ways which increases the flexibility of adapting them to the different tasks of decision-making

(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller

than the number of examples per class generating a decision structure from decision rules can

potentially be done on line

Such virtual decision structures can be tailored to any given decision-making situation The

needed decision rules have to be generated only once and then they can be used many times for

generating decision structures according to changing requirements of decision-making tasks The

method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures

from decision rules Decision structures represent a procedural form of knowledge which makes

them easy to implement but also harder to change Consequently decision structures can be quite

effective and useful as long as they are used in decision-making situations for which they are

optimized and the attributes specified by the decision structure can be measured without much

cost Figure 3-1 shows an architecture of the proposed methodology

Data

Decision -

- Learning knowledge from database The decision making process

Figure 3 -1 Architecture of the AQDT approach

25

It is assumed that the database is not static but is regularly updated A decision-making problem

arises when there is a case or a set of cases to which the system has to assign a decision based on

the knowledge discovered Each decision-making situation is defmed by a set of attribute-values

Some attribute-values may be missing or unknown A new decision structure is obtained such

that it suits the given decision-making problem The learned decision structure associates the

new set of cases with the proper decisions

32 A Brief Description of the AQIS and AQ17 Rule Learning Programs

The decision rules are generated from examples by an AQ-type inductive learning system

specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of

this subsection includes a brief description of the inductive learning systems AQ15 and AQI7

AQ15 learns decision rules for a given set of decision classes from examples of decisions using

the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology

called AQ starts with a seed example of a given decision class and generates a set of the

most general conjunctive descriptions of the seed (alternative decision rules for the seed

example) Such a set is called the star of the seed example The algorithm selects from the star

a description that optimizes a criterion reflecting the needs of the problem domain If the

criterion is not defined the program uses a default criterion that selects the description that

covers the largest number of positive examples (to minimize the total number of rules needed)

and with the second priority that involves the smallest number of attributes (to minimize the

number of attributes needed for arriving at a decision)

If the selected description does not cover all examples of a given decision class a new seed is

selected from uncovered examples and the process continues until a complete class description

is generated The algorithm can work with few examples or with many examples and can

optimize the description according to a variety of easily-modifiable hypothesis quality criteria

26

The learned descriptions are represented in the form of a set of decision rules expressed in an

attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A

distinctive feature of this representation is that it employs in addition to standard logic operators

the internal disjunction operator (a disjunction of values of the same attribute in a condition) and

the range operator (to express conditions involving a range of discrete or continuous values)

These operators help to simplify rules involving multi valued discrete attributes the second

operator is also used for creating logical expressions involving continuous attributes

AQ15 can generate decision rules that represent either characteristic or discriminant concept

descriptions depending on the settings of its parameters (Michalski 1983) A characteristic

description states properties that are true for all objects in the concept The simplest

characteristic concept description is in the form of a single conjunctive rule (in general it can be

a set of such rules) The most desirable is the maximal characteristic description that is a rule

with the longest condition part ie stating as many common properties of objects of the given

class as can be determined A discriminant description states properties that discriminate a given

concept from a fixed set of other concepts The most desirable is the minimal discriminant

descriptions that is a rule with the slwnest condition part For example to distinguish a given

set of tables from a set of chairs one may only need to indicate that tables have large flat top

A characteristic description of the tables would include also properties such as have four legs

have no back have four corners etc Discriminant descriptions are usually much shorter than

characteristic descriptions

Another option provided in AQ15 controls the relationship among the generated descriptions

(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode

rulesets of different classes may logically intersect over areas of the description space in which

there are no training examples In the DC (Disjoint Covers) mode descriptions of different

classes are logically disjoint The DC mode descriptions are usually more complex both in the

number of rules and the number of conditions There is also a DL mode (a Decision List mode

27

also called VL modeM-for variable-valued logic mode) in which the program generates rule sets

that are linearly ordered To assign a decision to an example using such rulesets the program

evaluates them in order If ruleset is satisfied by the example then the decision is made

otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes

rule sets can be evaluated in any order

Alternatively the system can use rules from the AQ17-DCI program for Data-driven

Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for

generating additional attributes These attributes are various logical or mathematical

combinations the original attributes The program generates a large number of potential new

attributes and seleets from them those most promising based on an attribute quality criterion

To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is

shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form

expression) describes a voting record of Democratic Representatives in the US Congress

Each rule is a conjunction of elementary conditions Each condition expresses a simple relational

statement For example the condition [State = northeast v northwest] states that the attribute

State (of the Representative) should take the value northeast or northwest to satisfy the

condition

Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp

[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]

Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives

The above rules were generated from examples of the voting records For illustration below is

an example of a voting record by a Democratic representative

Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks

28

in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered

By expressing elementary statements in the example as conditions and linking conditions by

conjunction the examples can be re-expressed as decision rules Thus decision rules and

examples formally differ only in the degree of generality

33 Generating Decision Structurestrees from Decision Rules

This section describes the AQDT-1 system for learning decision structures (trees) from decision

rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy

oriented decision structure from decision rules is included and fmally the methodology is

illustrated by two examples

Methods for learning decision trees from examples have been very popular in machine learning

due to their simplicity Decision trees built this way can be quite efficient as long as they are

used in decision-making situations for which they are optimized and these situations remain

relatively stable Problems arise when these situations significantly change and the assumptions

under which the tree was built do not hold anymore For example in some situations it may be

difficult to determine the value of the attribute assigned to some node One would like to avoid

measuring this attribute and still be able to classify the example if this is potentially possible

(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure

the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also

desirable if there is a significant change in the frequency of occurrence of examples from

different classes A restructuring of a decision tree to suit the above requirements is however

difficult to do The reason for this is that decision trees are a form of decision structure

representation and imposes constraints on the evaluation order of the attributes that are not

logically necessary

29

One problem in developing a method for generating decision structures from decision rules is to

design an attribute selection criterion that is based on the properties of the rules rather than of

the training examples A decision rule normally describes a number of possible examples Only

some of them are examples that have actually been observed ie training examples An attribute

selection criterion is needed to analyze the role of each attribute in the rules It cannot be based

on counting the numbers of training examples covered by each attribute-value and the

frequency of decision classes in the training examples as is done in learning decision trees from

examples because the training examples are assumed to be unavailable

Another problem in learning decision trees from decision rules stems from the fact that decision

rules constitute a more powerful knowledge representation than decision trees They can directly

represent a description in an arbitrary disjunctive normal form while decision trees can represent

directly only descriptions in the disjoint disjunctive normal form In such descriptions all

conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary

decision rules into a decision tree one faces an additional problem of handling logically

intersecting rules

The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2

system is based on the earlier work by Michalski (1978) which introduced a general method for

generating decision trees from decision rules The method aimed at producing decision trees

with the minimum number of nodes or the minimum cost (where the cost was defined as the

total cost of classifying unknown examples given the cost of measuring individual attributes and

the expected probability distribution of examples of different decision classes) More

explanations are provided in the following section

331 Tbe AQDT-2attribute selection metbod

This section describes the AQDT -2 method for building a decision structure from decision rules

The method for building a single-parent decision structure is similar to that used in standard

30

methods of building a decision tree from examples The major difference is that it assigns tests

(attributes) to the nodes using criteria based on the properties of the decision rules (this includes

statistics about the examples covered by each rule in the case of learning rules from examples)

rather than statistics characterizing the frequency of training examples per decision classes per

attribute-values or per conjunctions of both Other differences are that the branches may be

assigned an internal disjunction of values (not only a single value as in a typical decision tree)

and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be

attributes or names standing for logical or mathematical expressions that involve several

attributes or variables In the following we use the terms test and attribute interchangeably

(to distinguish between an attribute and a name standing for an expression the latter is called a

constructed attribute)

At each step the method chooses the test from an available set of tests that has the highest utility

(see below) for the given set of decision rules This test is assigned to the node The branches

stemming from this node are assigned test values or disjoint groups of values (in the form of

logical disjunction if such occur in the rules subsumed groups of values are removed) Each

branch is associated with a reduced set of rules determined by removing conditions in which the

selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset

indicate the same decision class a leaf node is created and assigned this decision class The

process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set

because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of

candidate decisions with associated probabilities (see Sec 42)

The test (attribute) utility is a combination of one or more of the following elementary criteria 1)

cost which indicates the cost of using each attribute for making decision 2) disjointness which

captures the effectiveness of the test in discriminating among decision rules for different decision

classes 3) importance which determines the importance of a test in the rules 4) value

31

distribution which characterizes the distribution of the test importance over its of values and 5)

dominance which measures the test presence in the rules These criteria are dermed below

~ The cost of a test expresses the effort or cost needed to measure or apply the test

Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the

disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and

decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote

sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm

respectively If a ruleset for some class say Ct contains a rule that does not involve test A than

Vt is the set of all possible values of A (the domain of A)

Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is

the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for

Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is

dermed by

if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt

where cjgt denotes the empty set Note that exchanging the second and third conditions of equation

(3-1) may seem to be an improved criteria However it does not clearly distinguish between both

cases (Le for both situations the disjointness will be similar) The current equation is better

because it gives higher scores to attributes that classify different subsets of the two decision

classes than attributes that classify only a subset of one decision class

Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the

sum of the degrees of class disjointness of each decision class

32

m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)

i=l i=l i~j

The disjointness of a test ranges from 0 when the test values in rulesets of different classes are

all the same to 3m(m-l) when every rule set of a given class contains a different set of the test

values If two tests have the same disjointness value the attribute to be selected is the one with

less number of values

Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree

is dermed as the average number of tests (attributes) to be examined from the root of the tree to

any leaf node in order to reach a decision

Definition A decision structure is a one-node-per-level decision structure if at each level there is

only one node and zero or more leaves

Such decision structure can be generated by combining together all branches that associated with

each one a set of decision rules belongs to more than one decision class

Theorem 1 Consider learning one-node-level decision structure for a database with two

decision classes The disjointness criterion ranks first attributes that add minimum number of

tests to the decision tree

Proof Consider all possible distributions of an attributes values within two decision classes Ci

and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not

subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the

same set of values in both classes is trivial one) Consider that branches leading to one subset

with the same decision class are combined into one branch In the first case there will be two

branches only The first leads to a leaf node and the other leads to an intermediate node where

another attribute to be selected The minimum ANT in this case is 53 In the second case three

branches should be created Two branches leads to a leaf node where all values at each branch

belong to only one and different decision class The third branch leads to an intermediate node

33

where another attribute should be selected furtur classifies the decision classes The minimum

ANT in this case is 64 In the third case only two branches will be generated where each leads

to leaf node with different decision class In this case the minimum ANT is 1

Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in

case of having more than one attribute-value at branches lead to leaves belonging to one decision

class they will be combined into one branch in the decision structure The symbol 1 means

that an attribute is needed to classify the two decision classes In such cases there will be at least

two additional paths

D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3

Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes

The average number of tests required for making a decision in each possible case is determined

in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks

highly attributes that reduces the average number of tests required for decision-making The

theorem can be proved in the general case

i ~ i Cj C1middot bull Ci Cj

ANT=32middotANT=53 ANT=l

1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3

34

Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify

the decision classes

Proof Suppose that the number of decision classes is n Assume also that there are two attributes

A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship

between any two decision classes this means that there are more decision classes where D(A

Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness

for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)

than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit

Cj) then B classifies the decision classes better than A

For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute

are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have

smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt

D(B Cit Cj) then attribute B can classify both decision classes better than attribute A

Similarly B is better for classifying more pairs of decision classes than A This implies that B is

a better classifier than A

Importance The second elementary criterion the importance of a test is based on the

importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained

rules each test is assigned a score that represents the total number of training examples that

are covered by the rules involving this test Decision rules learned by an AQ learning program

are accompanied with information on their strength Rule strength is characterized by t-weight

and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of

that class covered by the rule The importance score of a test is the aggregation of the totalshy

weights of all rules that contain that test in their condition part Given a set of decision rules for

m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated

with class Ci is denoted by tlrit The importance score is defmed as follow

35

Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by

m IS(Aj)= L IS(Aj Ci) (3-31)

i=l

where ri IS(Aj Ci) = L Rik(Aj) (3-32)

k=l

and Rik the weight of a test Aj in the rule Rk of class Ci is given by

t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise

where i=I n ik=I ri j=I m

The importance score method has been separately compared as a feature selection method with

a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method

produced an equal or higher accuracy on three real-world problems than those reported by the

GA method while selecting fewer attributes In addition the IS method was significantly faster

than the GA

yalue distribution The third elementary criterion value distribution concerns the number of

legal values of tests Given two tests with equal importance scores this criterion prefers the test

with the smaller number of legal values Experiments have shown that this criterion is especially

useful when using discriminant decision rules

Definition 34 A value distribution VD(Aj) of a test A j is defined by

VD(Aj) = IS(Aj) I Vj (3-5)

where v is the number of legal values of Aj

Dominance The fourth elementary criterion dominance prefers tests that appear in large

numbers of rules as this indicates their high relevance for discriminating among the rulesets of

given decision classes Since some conditions in the rules have values linked by internal

disjunction counting such rules directly would not reflect properly their relevance Therefore

36

for computing the dominance the rules are counted as if they were converted to rules that do not

have internal disjunction Such a conversion is done by multiplying out the condition parts of

the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is

multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]

The above criteria are combined into one general test measure using the lexicographic

evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of

the above elementary criteria each associated with a Ittolerance threshold It in percentage The

criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if

it scores on the previous criterion within the range defined by the tolerance (from the top value)

The default LEF is

ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)

where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O

The default value of the cost of each test is 1

The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost

If two or more attributes have the same cost they ranked by their degree of disjointness If two

or more attributes still share the same top score or their scores differ less than the assumed

tolerance threshold t2 the method evaluates these attributes using the second (Importance)

criterion If again two or more attributes share the same top score or their scores differ less than

the tolerance threshold t3 then the third criterion normalized IS is used then similarly the

fourth criterion (dominance) If there is still a tie the method selects the best attribute

randomly

If there is a non-uniform frequency distribution of examples of different classes then the

selection criterion uses a modified definition of the disjointness Namely the previously defmed

37

disjointness for each class is multiplied by the frequency of the class occurrence The class

occurrence is the expected number of future examples that are to be classified to a given class

m

Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l

where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given

by the user The attribute ranking criteria in this case is defined by the LEF

ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)

where the Cost denotes the evaluation cost of an attribute and is to be minimized while other

elementary criteria are treated the same way as in the default version

332 The AQDT-2 algorithm

The AQDT-2 algorithm constructs a decision structure from decision rules by recursively

selecting at each step the best test according to the ranking criteria described above and

assigning it to the new node The process stops when the algorithm creates terminal branches that

are assigned decision classes To facilitate such a process the system creates a special data

structure for each concept description (ruleset) This structure has fields such as the number of

rules the number of decision classes and the number of attributes present in the rules A set of

pointers connect this data structure to set of data structures each represents one decision class

The decision class structure contains fields with information on the number of rules belong to

that class the frequency of the decision class etc It is also connected to a set of data structures

representing the decision rules within each decision class The system creates independently a set

of data structures each is corresponding to one attribute Each attribute description contains the

attributes name domain type the number of legal values a list of the values the number of

rules that contain that attribute and values of that attribute for each rule The attributes are

arranged in an array in a lexicographic order flISt in the descending order of the number of rules

38

that contain that attribute and second in the ascending order of the number of the attributes

legal values

The system can work in two modes In the standard mode the system generates standard

decision trees in which each branch has a specific attribute-value assigned In the compact

mode the system builds a decision structure that may contain

A) or branches Le branches assigned an internal disjunction of attribute-values whenever it

leads to simpler structures For example if a node assigned attribute A has a branch marked by

values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The

program creates or branches on the basis of the analysis of the value sets Vi while computing

the degree of attribute disjointness

B) nodes that are assigned derived attributes that is attributes that are certain logical or

mathematical combinations of the original attributes To produce decision structures with

derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The

AQ17 rules may contain conditions involving attributes constructed by the program rather than

those originally given

To generate decision structures from rules the AQDT-2 method prefers either characteristic or

discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules

are more suitable for building decision structures Assume that the description of each class is in

the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is

The AQDT2 Algorithm

Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking

measure Select the highest ranked attribute Let A represents this highest-ranked

attribute

Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and

assign to it the attribute A In standard mode create as many branches from the node as

39

there are legal values of the attribute A and assign these values to the branches In

compact mode (decision structures) create as many branches as there are disjoint value

sets of this attribute in the decision rules and assign these sets to the branches

Step 3 For each branch associate with it a group of rules from the ruleset context that contain a

condition satisfied by the value(s) assigned to this branch For example if a branch is

assigned values i of attribute A then associate with it all rules containing condition [A=

i v ] If a branch is assigned values i v j then associate with it all rules containing

condition [A= i v j v ] Remove from the rules these conditions If there are rules in

the ruleset context that do not contain attribute A add these rules to all rule groups

associated with the branches stemming from the node assigned attribute A (This step is

justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming

that a and b are the only legal values of y) All rules associated with the given branch

constitute a ruleset context for this branch

Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf

node and assign to it that class If all branches of the trees have leaf nodes stop

Otherwise repeat steps 1 to 4 for each branch that has no leaf

To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the

algorithm performs two independent iterations In the first iteration it parses through all decision

rules and determine information about each attributes This information includes the importance

score of each attribute the number of rules containing a given attribute the disjoint value sets of

each attribute and the attribute values used in describing each decision class The second

iteration is only performed if the disjointness criterion is ranked first in the LEF function The

second iteration evaluates the attributes disjointness for each decision class against the other

decision classes

40

To determine the complexity of this process suppose that the maximum number of conditions

fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is

the total number of decision rules (in all decision classes)

m

r=LRt (m is the number of decision classes) 1=1

where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be

determined as

Cmpx(Itcl) =O(r s)

In the second iteration the disjointness is calculated between the decision classes and all

attributes The complexity of the second iteration can be given by

Cmpx(Itc2) = O(n m)

Assume that at each node 1 is the maximum of the number of decision classes to be classified at

this node and the number of rules associated with this node

l=max mr (3-9)

Since these two iteration are independent of each other the complexity of the AQDT algorithm

for building one node in the decision tree say Node Complexity NC(AQDT) is given by

NC(AQDT) = 0(1 n)

Usually I equals to the number of rules associated with the given node Thus the AQDT

complexity for building one node is a function of the number of attributes multiplied by the

number of rules associated with this node

At each level of the decision tree these two iterations are repeated for each non-leaf node The

Level Complexity of the AQDT algorithm LC(AQDT) can be given by

LC(AQDT) lt 0(1 n)

Which is less than the complexity of generating the root of the decision tree To explain this

consider the maximum possible number of non-leaf nodes at one level is half the number of the

initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at

41

the lowest level each node classifies only two rules each belongs to different decision class This

decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level

will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the

level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf

nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the

complexity of the AQDT algorithm at any lower level is given by

LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)

a) per one level b) per one path

Figure 3 -5 Decision trees showing the maximum number of none leaf nodes

Note also that after selecting an attribute to be the root of the decision structure this attribute

and all conditions containing that attribute are removed from the data structure of the algorithm

Also if a leaf node is generated all rules belonging to the corresponding branch will not be

tested again

Since the disjointness criterion selects the attribute which minimizes the average number of tests

ANT it is true that the AQDT algorithm generates decision trees with the least number of levels

The number of levels per a decision tree is supposed to be less than or equal to the minimum of

both the number of attributes and the number of rules Consider k as the number of levels in a

given decision tree

k S min mr (3-10)

There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst

case where the decision rules were divided evenly the number of levels will be a function of the

42

logarithm of the number of rules In such case the complexity of the AQDT algorithm for

generating a decision tree from a set of decision rules is given by

Complexity(AQDT) = 0(1 n log r) (3-11)

The other situation is when the generated decision tree has the maximum number of levels The

maximum possible number of levels per a decision tree equals to one less than the number of

decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision

tree because it has the maximum average number of test (ANT) that can be determined from the

same set of nodes and leaves However such a decision tree can be generated if the number of

decision classes is one less thanthe number of attributes In such case any disjQint decision rules

should have a maximum length that is less or equal to the floor of the logarithm of the number of

attributes Thus the level complexity of this decision tree is estimated as

LC(AQDT) =0(1 log n)

The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT

algorithm in such cases is given by

Complexity(AQDT) = 0(1 k log n) (3-12)

Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)

r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT

algorithm is determined by

Cmplx(AQDT) = OCr k log 1) (3-13)

333 An example illustrating the algorithm

The following simple example illustrates how AQDT is used in selecting the optimal set of

testing resources for testing a software Suppose there are three tools for testing a software 1)

modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four

different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the

metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool

43

(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible

values

Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy

6 shows a sample of these rules in AQ15c formats

Table 3-1 The available tools and the factors that affect the prclCes

Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]

Figure 3-6 Decision rules for selecting the best tool for testing software

These rules can be interpreted as

Rule 1 Use the fltst tool for testing if you need average cost and the tool is

supported by the requirement metric

Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either

in the requirement or the analysis phases and you need an automated tool

Rule 3 Use the second tool for testing if the cost limit is low or average and the

tool is supported by the system usage metric or the intractability metric

Rule 4 Use the second tool for testing if you can afford high cost for testing

either in the requirement or the design phases and you need a manual tool

Rule 5 Use the third tool for testing if the you are limited to low cost and the

tool is supported by the error rate metric

Rule 6 Use the third tool for testing if you can afford very-high cost for testing

either in the requirement or the system usage phases and you need a semishy

automated tool

44

Table 3-2 presents information on these rules and the disjointness values for all attributes For

each class the row marked Values lists values occurring in the ruleset for this class For

evaluating the disjointness of an attribute say A each rule in the ruleset above that does not

contain attribute A is characterized as having an additional condition [A= a vb ] where a b

are all legal values of A

The row Class disjointness specifies the class disjointness for each attribute The attribute xl

has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume

the tolerances for each elementary criterion equal O

From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used

for the compact mode of the algorithm This done as follows 1) determine for each attribute the

sets of values that the attribute takes in individual decision rules and remove those value sets

that subsume other value sets The remaining value sets are assigned to branches stemming from

the node marked by the given attribute For example x 1 has the following value sets in the

individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is

removed as it subsumes 2 and I In this case branches are assigned individual values of the

domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case

branches are assigned value sets I 2 and 34

45

Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the

tree Four branches are created each one corresponding to one of XIs possible values Since all

rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules

containing other values of X1 belong to more than one class This process is repeated for each

subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned

by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in

making decisions on which tools can be used for testing a given software

xl

Complexity No of nodes 4 No of leaves 7

Figure 3-7 A decision structure learned for classifying so tware testing tools

Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows

the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells

representing one combination of attribute-values Attributes and their legal values are shown on

scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)

Rules are represented by collections of cells in the intersection of the rows and columns

corresponding to the conditions in the rules

The shaded areas correspond to decision rules Rules of the same class have the same type of

shading Empty cells correspond to combination of attribute-values not assigned to any class For

illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21

R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the

IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]

46

amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp

[x4=3]

a) Decision rules b) Derived decision tree

Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more

general description of Concepts Tl TI and T3 than the original rules Let us assume that it is

very costly to know which metrics suppon the required tools In other words suppose that we

would like to select the best tools independently of the metric they suppon (this is indicated to

AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to

the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision

without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or

T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be

either Tl or T3 However for the value 3 of xl one can make a specific decision after

measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the

recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool

was ignored from the data Those decision trees are called indeterminate because some of their

leaves are assigned a disjunction of two or more class names

47

xl

xl

1 12

a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool

It is clear that the given set of rules depend highly on amount of money can be spent Let us now

suppose that the cost attribute xl which was determined as the highest ranked attribute cannot

be measured The algorithm selected x4 to be the root of the new decision tree After continuing

the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate

situations in which it is possible to make a specific decision without knowing the value of

attribute xl

x4

Figure 3-10 A decision tree learned without the cost attribute

34 Tailoring Decision Structures to the Decision-making situation

Decision structures are among the simplest structures for organizing a decision-making process

A decision structure specifies explicitly the order in which attributes of an object or situation

need to be evaluated in the process of determining a decision A standard way to generate a

48

decision structure is to learn it from examples of decisions Such a process usually aims at

obtaining a decision structure that has the highest prediction accuracy that is the highest rate of

assigning correct decisions to given situations There can usually be a large number of logically

equivalent decision structures (Michalski 1990) As such they may have the same predictive

accuracy but differ in the way they organize the decision process and thus may differ in the cost

of arriving at a decision To minimize the average decision cost one needs to take into

consideration the distribution of the costs of attribute evaluation and the frequency of different

decisions This report presents an approach to building such Task-oriented decision structures

which advocates that they are built not from examples but rather from decision rules Decision

rules are learned from examples using the AQ15 or AQ17 inductive learning program or are

specified by an expert An efficient algorithm and a new system AQDT-2 that transforms

decision rules to task-oriented decision structures The system is illustrated by applying it to the

problem of learning decision structures in the area of construction engineering (for determining

the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other

programs applied to the same data

Decision-making situations can vary in several respects In some situations complete

information about a data item is available (ie values of all attributes are specified) in others the

information may be incomplete To reflect such differences the user specifies a set of parameters

that control the process of creating a decision structure from decision rules AQDT-2 provides

several features for handling different decision-making problems 1) generating a decision

structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown

leaves in situations where there is insufficient information for generating a complete decision

structure 3) providing the user with the most likely decision when performing a required test is

impossible 4) providing alternative decisions with an estimate of the likelihood of their

correctness when the needed information can not be provided

49

341 Learning Costmiddot Dependent Decision Structures

As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of

measuring tests (attributes) in developing a decision structure In the default setting of LEF the

tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive

attributes pass to the next step of evaluation involving other elementary criteria If an attribute

has high cost or is impossible to measure (has an infinite cost) the LEF chooses another

cheaper attribute ifpossible

342 Assigning Decision Under Insufficient Information

In decision-making situations in which one or more attributes cannot be measured the system

may not be able to assign a definite decision for some cases If no more information can be

obtained but a decision has to be made it is useful to know the probability distribution for

different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is

then chosen The probability distribution can be estimated from the class frequency at the given

node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof

attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m

at that node given that an example to be classified has attribute-values assigned to branches bl

b2 bull bk in the decision structure Using Bayesian formula we have

P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)

where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the

example has attribute-values blbull bk given that it represents decision class Ci To approximate

these probabilities let us suppose that Wi is the number of training examples of Class Ci that

passed the tests leading to this node and twi - the total number of training examples of Class

Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class

probabilities correspond to the frequency of training examples from different classes we have

50

m P(Ci) = twi IL twj (3-10)

j=l

P(b1 bkl Ci) = Wi I twi (3-11)

m m P(b1 bk) =L Wj IL twj (3-12)

j=l j=1

By substituting (1O) (11) and (12) in (9) we obtain

m P(Ci I b1 bk) = wilL Wj (3-13)

j=l

A related method for handling the problem of unavailability of an attribute is described by

Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision

at the node associated with such an attribute The method does not restructure the decision tree

appropriately to fit the given decision-making situation (in this case to avoid measuring xl)

AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given

decision-making si~ation An example is presented in section 42

343 Coping with noise in training data

The proposed methodology can be easily extended to handle the problem of learning from noisy

training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986

Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data

the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are

removed The rule truncation method seems to result in more accurate decision structures than

decision tree pruning method because truncation decisions are solely based on the importance of

the given rule or condition for the decision-making regardless of their evaluation order (unlike

the decision tree pruning which can only prunes attributes within a subtree and thus cannot

freely chose attributes to prune) Examples are presented in section 4

51

3S Analysis of the AQDT2 Attribute Selection Criteria

This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The

analysis is done using two different problems Both problems assume that the best attribute to be

selected is known and they test whether or not a given attribute selection criterion will rank that

attribute first The first problem was introduced by Quinlan in 1993 The problem has four

attributes (see Table 2-2) and two decision classes The best attribute to be selected is

Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two

attributes and two decision classes The problem has ambiguous examples (Le examples

belonging to more than one decision class)

In the first problem the following are disjoint rules learned by AQ15c from the given data

Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]

For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in

Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given

data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over

the other attributes

Table 3-4 shows the set of examples used in the second experiment Note that the representation

of the data here is different from the representation used by Mingers The attribute X is better for

building decision tree than the attribute Y The ambiguity in the data increases the complexity of

selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to

52

select the correct attribute and then to test how the given criterion may evaluate the two

attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the

second (Y) However the gain ratio criteria gave X higher score than all other criteria

Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers

first problem The criteria were tested when applied on both examples and rules learned by

AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision

rules for class C all ambigous examples belonging to C are considered as examples that belong

toC only

The table shows that the disjointness criterion outperformed all other criteria in Table 2-6

including the gain ratio criterion when it was applied to both the original examples and the rules

learned from these examples It was clear that neither the importance score nor the value

distribution criteria will perform better in the case of evaluating the training examples This is

because the two criteria depend on the relationship between the attributes and the decision rules

53

which can not be measured from examples When these criteria applied on the learned rules they

provide very good results

The disjointness criterion selects the attribute that best discriminates between the decision

classes The importance criterion gives the highest score to the attribute that appears in rules

covering the largest number of examples The value distribution criterion ranks first the attribute

which has the maximum balanced appearance of its values in different rules The dominance

criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the

possible LEF ranks for each criterion the domain of each criterion and when the criteria is better

to be ranked fIrst

In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of

an attribute and R is the total number of elementary rules

36 Decision Structures vs Decision Trees

This subsection introduces a comparison between the decision structures proposed in this thesis

and traditional decision trees Even though systems for learning decision structure may be more

complex and they take more time to generate decision structures from examples They have

many other advantages Table 3-7 shows a comparison between the two approaches

54

between Decision Structures and Decision Tree

Another important issue is that simpler decision trees are not necessarly equivalent to simpler

decision structures For example consider the two decision structures in Figure 3-11 Both

structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes

This decision structure is equivalent to a decision tree with 41 nodes On the other hand the

decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent

to a decision tree with 37 nodes

55

x5

p Positive Nmiddot Negative No of nodes 5

a) Using the Disjoinmess criterion xl

P-Positive N bull Negative No of nodes 7 No of leaves 9

b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria

To show another advantage of learning decision structures (trees) from decision rules rather than

from examples I created an example called the Imams example that represents a class of

problems in which decision tree learning programs information gain criteria does not work

properly The basic idea behind this example is based on the fact that the information-based

criteria are based on the frequency of the training examples per decision class and the frequency

of the training examples over different values of a given attribute

The concept to learn is P if xl=x2 and N otherwise

The known training examples are shown in Figure 3-12-a where u+ means this example belongs

to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct

decision tree to be learned As the reader can see the number of examples per each class is 12

and the frequency of the training examples per each value of xl and x2 (the most important

attributes) is 6 However the frequency of the training examples per each of the values of x3 and

x4 are different This combination makes it difficult for criteria using information gain to select

56

either xl or x2 When C4S ran with different window settings it was not able to select neither

xl nor x2 as the root of the decision tree

~ ~

-~ t-) r- shy~

-I-shy

t-) t-)

I-shy~

1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree

Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples

AQ15c learned the following rules from this data

P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]

From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b

An example of problems in which decision trees may not be an efficient way to represent

knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the

given data The decision rules learned from this data are

P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]

Using different measures of complexity it is clear that for such a problem learning decision trees

directly from examples if not an efficient method Examples of these measures are 1) comparing

the number of nodes in the decision tree to the number of examples (109) 2) comparing the

average number of tests required to make a decision (13n for the decision tree and 85 for

57

decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10

conditions)

When learning a decision tree from rules learned with constructive induction a decision tree

with three nodes can be determined using the new attribute xllx2=2 with values 0 for no

and 1 for yes

1 2 31sect] a) The training data b) The correct decision tree

Figure 3middot13 An example where decision rules are simpler than decision trees

CHAPTER 4 Empirical Analysis and Comparative Study

This section presents empirical results from extensive testing of the method on six different

problems using different sizes of training data and applying different settings of the systems

parameters For comparison it also presents results from applying a well-known decision tree

learning system (C45) to the same problems This section also includes some analysis and

visualization of the learned concepts by AQI5c and AQDJ2

The experiments are applied to the following problems MONK-I MONK-2 MONK-3

Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer

Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned

with learning classification rules for robot-like figures MONK-l requires learning a DNF-type

description MONK-2 requires learning a non-DNF-type description (one that cannot be easily

described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from

noisy data The Engineering Design dataset involves learning conditions for applying different

types of wind bracings for tall buildings Mushrooms is concerned with learning classification

rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer

involves learning concept descriptions for recognizing breast cancer Congressional bting

records describes the voting records of republican and democratic US senators for 1984 The

East-West Train characterizes eastbound and westbound trains using structural representation

In order to determine the learning curve the system was run with different relative sizes of the

training data 10 20 90 Specifically from the set of available examples for each

problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training

that is for learning a concept description The remaining examples in each case were used for

testing the obtained descriptions to determine the prediction accuracy of the descriptions

58

59

41 Description of the Experimental Analysis

This section describes the complete experimental analysis The problems (datasets) were divided

into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind

Bracing problems) were used to test and analyze the approach The second set of problems

(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for

additional testing and comparison with other systems

Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The

best settings (best path from top-down) in terms of accuracy time and complexity were used as

default settings for experiments in the second set ofproblems Each path from the top of the graph

to the bottom represents a single experiment For each path the experiment was repeated over 900

times with different sets of different sizes of training examples

Figure 4-1 Design ofa complete experiment

60

For each of these experiments the testing examples were selected as the complementary set of the

training examples Other experiments were performed where the learning system AQ17 is used

instead of AQ15c Analysis of some experiments included visualization of the training examples

and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different

decision structures learned for different decision-making situation were visualized as were different

but equivalent decision structures learned for a given set of training examples

For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are

used for testing (900 samples for training and their 900 complementary samples for testing)

162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)

73 days (estimated running time)

The following subsection includes a complete experimental analysis on the wind bracing problem

Each subsection following that will describe partial or full experimental analysis ofone of the other

problems

42 Experiments With Average Size Complex and Noise-Free Problems Wind

Bracing

This section illustrates the method by applying it to the problem of learning a decision structure for

determining the structural quality of a tall building design The quality of the design is partitioned

into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is

characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)

number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of

horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly

selected to serve as training examples and 115 (34) were used for testing the obtained decision

61

structure In the first phase training examples were used to detennine a set of decision rules This

was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules

obtained by AQI5c

These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values

of the four elementary criteria for each attribute occurring in the rules for the step of detennining

the root of the decision structure For each class the row marked values lists values occurring in

the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the

ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b

] where a b are all legal values of A

Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)

Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)

Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)

Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)

Figure 4-2 Decision rules determined by AQI5c from the wind bracing data

62

Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single

highest and all other attributes are beyond the tolerance threshold no other attributes are

considered) Branches stemming from the root are marked by values x6 (in general it could be

groups of values) according to the way they occur in the decision rules groups subsumed by

other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the

rules containing these values The process repeats for a branch until all rules assigned to each

branch are of the same class That class is then assigned to the leaf

Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2

(using the default LEF) The structure was evaluated on the testing examples The prediction

accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)

Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3

Rules containing [x6=1] belong to more than one class In this case the first three criteria are

recalculated only for those rules which contain [X6=l] as one of their conjunctions In this

example Xl has the highest importance score so it was selected to be a node in the structure This

process is repeated for each subset of rules until the decision structure is completed

For comparison the program C4S for learning decision trees from examples was also applied to

this same problem (Quinlan 1990) The experiment was done with C4S using the default window

63

setting (maximum of 20 the number of examples and twice the square root the number of

examples) and set the number of trials set to one C45 was chosen for the comparative studies

because it one of the most accurate and efficient systems for learning decision trees from examples

and because it is widely available

The C45 program has the capability for generating a decision tree over a window of examples (a

randomly-selected subset of the training examples) It starts with a randomly selected window of

examples generates a trial tree test this tree against the remaining examples adds some

unclassified examples to the original ones and continues until either all training examples are

classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is

learned by C45 When this decision tree was tested against 115 testing examples only 97

examples were classified correctly and 18 were mismatches

x6

ComplexIty No ofnodes 17 No of leaves 43

Figure 4-3 A decision tree learned by C45 for the wind bracing data

Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from

AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing

examples results in 102 examples matched correctly and 13 examples mismatched

64

Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition

that xl cannot be measured Leaves marked represent situations in which a definite decision

cannot be made without knowing xl This incomplete decision tree was tested on 115 testing

examples from which value of xl was removed The decision structure classified 71 examples

correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be

replaced by sets of candidate decisions with their corresponding probability distribution

omplextty No of nodes 5 No ofleaves 9

Figure 4-4 Adecision structure learned from AQ15c wind bracing rules

Complextty

x2 -~--

No of nodesmiddot 6 No of leaves 8

Figure 4-5 A decision structure that does not contain attribute xl

Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate

decisions with decision class probability estimates Let us consider node x2 The example

frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using

equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be

approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11

Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were

truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy

65

weight represented 10 or less coverage of the training examples in a given class were removed)

The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89

for the decision structure in Figure 4-4)

~ ~63tCl 66 ~ f1l 37

Complexity No of nodes 5

1C2)23 Cl53 No of leaves 7 ~ 11 eJ47

Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves

Complexity 2v3 No of nodes 3

C2 No ofleaves 5

Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class

To demonstrate changes in the concept description learned by AWf-2 with different decisionshy

making situations four attributes were selected for visualizing the change in the learned concept

after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in

the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes

and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy

making situation had xl being given high cost AWf-2 learned a decision structure with five

nodes and seven leaves The predictive accuracy of this decision structure was 791

Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal

situations then when x5 was unavailable and when xl was unavailable The diagram is simplified

using only the four attributes which used in building the initial decision trees The visualization

diagram indecates different shades for different decision classes Another shade is used to illustrate

cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)

66

Also white cells indecate that an accurate decision can not be driven from the rules without

knowing the value of the removed attribute In such cases multiple decision can be provided with

their appropriate probabilities

means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking

situations for the wind bracing data

Experiments with Subsystem I The initial part of the experiments involved running AQ15c

for a set of learning problems with 18 different parameter settings for AQ15c (two types of

decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or

ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that

gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table

4w2) were selected for experiments with Subsystem IT

These experiments were performed for four learning problems (three MONKs problems [Thrun

Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two

parameter settings of AQ15c were selected for testing different parameter settings of AQD12

Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the

predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value

in this table is an average value of predictive accuracy of running either one of the two programs

100 times on 100 distinct randomly selected training data of the given size Each of these runs was

tested with testing examples that represent the complement of the training example seL

67

Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c

and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting

covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means

discriminant rules and the number is the width of the beam search

Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed

and selected parameters of Subsystem n are modified The experiments were performed on

characteristic decision rules that were learned in intersecting or disjoint modes For each data

set the results reported from each experiment is calculated as the average of 1()() runs of different

training data for 9 different sample sizes The parameters changed in this experiment were the

68

threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2

algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples

covered by rules belonging to different decision classes at a given node of the decision

structuretree

GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95

90r 90 85e 85 g 80

7Se 80

7S8 70

-

70 6S as 60 60

o 20 40 60 80 100 o 20 40 60 80 100

-J ~

--eoshy-_ bull

AQM AQlSc

bull

middot

middot -l ~-

~ LfZE AQDT

middot ----- AQlSc

I bull ltDisj Disc 1gt ltIntrDisc1gt ~

i 9S 9S

90 90

Mrn 85 85

~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100

The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for

the wind bracing Problem

Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy

2 with different parameter settings The default curve means predictive accuracy learned in the

default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization

degree is 10 The results show that with the wind bracing data it is better to reduce the

generalization degree to 3 However changing the pre-pruning degree did not improve the

predictive accuracy

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems

were set to their default parameters All the results reported here are the average of 100 runs For

~

~ 1-1

V ~

EJo- AQM--- AQlSc

I I

shy

1

J ~

~ IIIoooo

1----EJo- AQM--- AQlSc

I bull

69

94

WIND

~

r-

S

~ fI

~ -(

4 -4 -1shy -

I ) T-C ~

bull Default

ff~ ~ ~tlen-shy - gm

bull bull bull

each data set we reported the predictive accuracy the complexity of the learned decision trees and

the time taken for learning Figure 4-11 shows a simple summary of these experiments

IIIgt91 ltlntr Chargt91

(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data

--- Iq1f

WIND ~-

~ - -11-1 -it ~

r c rl gt )c shytJH 0 Default

I E 8mun lSpnm ~len--0-shy gen

--0- 00--- 81)1

-0- AQISc

v-

V t

( t ~ ~

L

1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100

Relative size of training examples () Relative size of training examples () Relative size of training examples ()

Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data

43 Experiments With Small Size Simple and NoiseFree Problems MONK

This subsection describes an experimental analysis of the AWT approach on the MONK-1

problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification

rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists

of two decision classes Positive and Negative and six attributes xl head-shape (values are

octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy

smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color

(values are red yellow green or blue) and x6 has-tie (values are yes or no)

~ rl

~ If ---shy NPf

--0shy AQl5c--0-shy C45

- ~ -

I(

--- AQM --0- au

11184 0972 OJ9

60 i 07oIiS

It deg24

~ 12 OJI Z 0 M

70

The original problem was to learn a concept from 124 training examples (62 positive and 62

negative) These training examples constitute 29 of all possible examples (432) thus the

density of the training examples is relatively high Figure 4-12 shows a visualization diagram

obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and

negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c

from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2

criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned

when using different criteria

11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4

1 2 3

~1gtc x6 x5 x4

Figure 4-12 A visualization diagram of the MONK-l problem

The AQDT-2 program running in its default mode and with the optimality criterion set to minimize

the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with

71

41 nodes For comparison the program C45 for learning decision trees from examples was also

applied to this same problem

Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]

Figure - S10n rules learn

The C45 program did not produce a consistent and complete decision tree when run with its

default window size (max of 20 and twice the square root of number of examples) nor with

100 window size After 10 trials with different window sizes we succeeded in making C45

produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is

presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was

used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that

takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These

rules were

Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]

attribute selection criteria for the MONK -1 problern

From these rules the system produced the compact decision structure presented in Figure 4-15-b

It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically

equivalent and they all have 100 prediction accuracy on testing examples (which means that they

72

represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler

decision structure was produced (Figure 4-15-a)

x5

xl

Complexity No of nodes 13

P - Positive N - Negative No of leaves 28

Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I

Complexity x5No of nodes 5

No of leaves 7

Complexity No of nodes 2

P - Positive N - Negative P - Positive N - Negative No of leaves 3

a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem

Experiments with Subsystem I As was mentioned earlier the initial part of the experiments

involved running AQI5c for a set of learning problems with 18 different parameter settings for

AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy

intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and

10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and

laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were

perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by

73

AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from

the decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs has tested with a testing example set that represented the complement of the training example

set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between

AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers

74

ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant

rules

ltDisj Char 1gt ltlntr Char 1gtVI

2 102 ~

J JI

iii- AQDT ---- AQlSc

bull bull bull

102Q

~ 96 96

~ 90 90

84 84

-~

78 788 72 72

110sect 66 66

B 60 60

r ~

I r ii

p

I - AQDT --- AQlSc

bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80

ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -

96 ~ 90 90 ~ 84~2 84 ~Qm ~

l~n n 0_ 66 66

g ~

60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for

the MONK-l Problem

Experiments with Subsystem II the same experiments were performed on the MONK-l

problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were

modified The experiments were perfonned on characteristic decision rules that were learned in

intersecting or disjoinettt modes For each data set the results reported from each experiment

were calculated as the average on 100 runs of different training data for 9 different sample sizes

The parameters to be changed in this experiment were the threshold of pre-pruning of the decision

rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure

4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with

different pammeter settings The default curve means predictive accm3Cy learned in the default

setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree

shy --~ --~ --

lL shy ~rshy

rJ

K a- AQDT

1----- AQlSc

bull bull

--~ -- --~ --~ Il

1 rr J

III AQDT

----- AQlSc

bull bull I

75

96

102

secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)

~o~~~~~--4-4~~

is 10 The results show that with the MONK-l data it is slightly better to reduce the

generalization degree to 3 However increasing the pre-pruning degree did not improve the

predictive accuracy

MONK-l ltllsj Chargt 102 102

86C)l

90-5 90

84~184 ~~ 78 78

~sect 72 72

~~ 66 66 ~middots 60 60 ~ e

-J y

~ - --

I r

l~

W~ bullI 3 S lfIII1

31Icu --0-- 2Ogcu bull bull

0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary

of these experiments

MONK-l ltlntr Chargtbull JiI j - -shy

~ ~~ fI - shy

I c

I F- Default

bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen

bull bull bull

MONK-l MONK-l

J

rf )S u-

gt 1 -- NPf

-1)- AQlSc --a-- CAS bull

90U80 j70

bull If

~

~ -

J

--- AQDT -11 C45

8 016

d60

i ~30 )20 E 10

Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80

Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80

Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem

I

76

44 Experiments With Small Size Complex and Noise-Free Problems MONK-2

The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be

easily described as a DNF expression using its original attributes) The problem is described in

similar way to the MONK-l problem The data consists of two decision classes Positive and

Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy

shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding

(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and

x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training

examples (62 positive and 62 negative) These training examples constitute 40 of all the possible

examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and

negative) and the concept to be leamed

t-shy shyCI ~shy ~ CI CI ~~ f- C)

CI

-CI- -CI - -CI

~

CI

~

C)

CI

- shyCI ~ -f- CI C)

CI -~ f- C)

CI

~~llt x6

~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~

Figure 4-19 A visualization diagram of the MONK-2 problem

77

Experiments with Subsystem I The two settings that gave best results in tenns of predictive

accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were

selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules

learned by AQ15c from examples and the predictive accuracy of decision structures learned by

AQDT-2 from these decision rules

Each value in that table is an average value of predictive accuracy of running either one of the two

programs 100 times on 100 distinct randomly selected training data of the given size Each of these

runs is tested with a testing examples that represents the complementary of the training examples

78

Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c

and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt

means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and

the number is the complexity of the beam search

~ ~

~

~

~

i A ltA

~ IIJo- ASOf---- A lSc

I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt

ltIntr Char 1gt SJbull bull gt 100

90

8080

70 70

6060

5050 o 10 20 30

ltIntr Disc 1gt ~ 100 100

i 9090

~ bull 80 80 ~fj

70 Jj 170

i~ 6060 4)5 50 50

40r 40

~ shy

i-I

m-shy AQDT---_ AQlSc

~ ~

~ ~

40 50 60 70 80 90 100

~

-

~----_

AQDT AQlSc

~

~j ~

~ AQDT

----- AQlSc

lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data

Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem

Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and

selected parameters ofSubsystem II are modifIed For each data set the results reported from each

experiment was calculated as the average of 100 runs of different training data for 9 different

sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of

the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam

1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by

AQIJI2 with different parameter settings The default curve means predictive accuracy learned in

the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default

generalization degree is 10 The results show that with the MONK-2 data it is slightly better to

79

reduce the generalization degree to 3 However increasing the pre-pruning degree did not

improve the predictive accuracy

85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~

77

~~ 69 gt8shy

J~ L shy 1 69

~ u8

61 r- - ~- -1 ~ _I

afault mun

61

~ -Of 53

(1 ~ -lt t 0 10 20 30 40 50

=Fshy-

60 -shy-a-shy

70 80

IS 3 lien 2()11

DO

pnm

len 100

53

~ 0 10 20 30 40 50 60 70 80 DO 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data

Comparative Study This sub-section presents a comparison between the decision trees

obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to

their default parameters The experiments were divided into two parts All the results reported here

are the average of 100 runs For each data set we reported the predictive accuracy the complexity

of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary

of these experiments

~

A

~ ~ Ii k i shy )0-( -

~ - I Default

--8shy RIInrun ISpnm

--1shy 311_ 2Ot co

MONK-2 MONK-2

c c 1

V lA t ~

=

AIPfbull

--r- CAs-0- AQlSc

1- 10

MONK-2100 m

~ lim cd

iL1) ~~ ~~ degtl46 37 )40

~

J~ i 04

f

~~

-- AQDT

-~ 015 28 ~o 00

AfI1f AQISc-0shy 00- --0-

--6- ~I

r A

~ ~ A ~ ~~ tl rp r--r~~ c

o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()

Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem

45 Experiments With Small Size Simple and Noisy Problems MONK-3

MONK-3 requires learning a DNF-type description from noisy data The problem is described in

similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same

80

domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a

visualization diagram of the training examples (positive and negative) and the concept to be

learned The minus signs in the shaded area and plus sign in the unshaded area are considered

noisy examples Noisy examples are examples that assigned the wrong decision class

r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem

Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by

AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from

these decision rules Each value in that table is an average value of predictive accuracy of running

both of the two programs 100 times on 100 distinct randomly selected training data sets of the

given size Each of these runs was tested with a testing examples that represented the complement

of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive

accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c

81

Experiments with Subsystem ll In these experiments the parameters of Subsystem I the

learning process were fixed and selected parameters of Subsystem IT the decision making

process were changed The results reported from each experiment was calculated as the average of

100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in

the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings

The default pre-pruning threshold is 3 md the default generalization degree is 10 The results

show that with the MONK-3 data it is usually better to reduce the generalization degree Also

increasing the pre-pruning threshold does not improve the predictive accuracy

bull bull

82

-- -- - -- r]

94

90

86

82

78

ltDisj Char 1gt

shy -

U-- A817I---e-- A Ix

I I bullbull bull

ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI

~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t

OO -+---+--+--+--t---I---

sa -I-f--t--t---r--+-shyIr-JL--i--I

82 -+---+--+--+--1_shy

78 ~r-I-+-or+If__rIt

o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy

i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100

The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for

the MONK-3 Problem

100 bull 100 MONK 3 ltDis 0IaIgt ~ i-

~

~r ~

I bull r ~auJl

~-a-e- ----

oa 98~188 96~ ~ 116

~sect ~~~ ~

i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data

Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by

AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a

summary of the predictive accuracy the complexity of the learned decision trees and the learning

time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the

fact that the testing data is not fixed for each sample In other words one error may represent

MONKmiddot3

ir

~

~

DofmII1_==s= _ ~

83

101

101 error rate when testing against 90 of the data and the same error may represent 10 when

testing against 10 of the data These curves do not repr~sent the learning curve

MONKmiddot3 MONKmiddot3MONKmiddot3 20

iI(j 19 020 -18

I-I

11

-- AQIJI

-D CA

ic1l17 ~ OJSg 16 c s OJo ~ IS

~ ~ 14 i=oosi 13 11 000

ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()

Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem

Z 12

46 Experiments With Large Complex and Noise-Free Problems Diagnosing

Breast Cancer

The breast cancer database is concerned with recognizing breast cancer The data used here are

based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)

The data has 699 examples represented using ten attributes and grouped into two decision classes

(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)

Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial

Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes

except the sample code number had a domain of ten values (there were scaled)

In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the

experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the

results reported here were based on the average of 100 runs For each data set we reported the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-27 summarizes these experiments

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that

the testing data is not fixed for each sample In other words one error may represent 101 error

- 4 shy7

if I I iI

____ NPC

-0- AQISc --a-shy C45

---- NJf1t

-0- NJI5c bull --0-shy W V--r- HlD-I ~

~ ~ r ~

~~~-4 101 1

bull bull bull bull

84

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

r I

J J~~

~

I ~

I AQDT --a-- C4S

bull I I

BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91

1I $ CJ6

~ ~m III

J~ 40 u 87 Z m DJI

~

~J1a - roj -1

--- NPf AQ15c--0shy --a-- C4s

-shy Jq1f ~~ -0- AQISc

bull -II 00 -a- Hll)1 iI V)

r ~

bull V

~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100

Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()

Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem

47 Experiments With Large Complex and Noisy Problems Mushroom

Classifications

Learning from the Mushroom database involves with classifying mushrooms into edible or

poisonous classes The data was drawn from the Audubon Society Field Guide to North American

Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected

to perform the experiment Each example was described by 22 attributes These attributes are 1)

Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing

8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy

smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy

color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat

To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults

and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S

All the results reponed here are the average of 100 runs For each data set we reponed the

predictive accuracy the complexity of the learned decision trees and the time taken for learning

Figure 4-28 shows simple summary of these experiments

In this problem C4S produces better accuracy with more complex decision trees (almost twice the

size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees

85

The average difference in accuracy is less than 2 the average difference in tree complexity is

greater than 10 nodes and the average difference in learning time is about the same

The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that

the testing data is not fixed for each sample In other words one error may represent 101 error

rate when testing against 90 of the data and the same error may represent 10 when testing

against 10 of the data These curves do not represent the learning curve

Musbroom Mulbroam100 ~~ ~~

~I~ jM5 ju94 8 I~ ~n

~ Is ~ ~ 4

~

r

~) 4 -- AQDT

--a- OLS OJ 00

--6shy

~----NlDT

-0shy Nlamp --a-shy OIS

IlU)I

L

C

~

P

L ~~

o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100

Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt

Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem

48 Experiments With Small Structured and Noise-Free Problems East-West

Trains

Learning 1ltsk-oriented Decision Structures from structural data This subsection

briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision

structures The experiment involved the East-West problem (Michie et al 1994) whose goal was

to classify a set of trains into two classes eastbound and westbound The data was structured such

that each train consisted of two to four cars Each car was described in terms of two main

features- the body of the car and the content of the car The body of the car was described by 6

different attributes and the load of the car was described by two attributes

The original description of the trains was given in Prolog clauses The AWf-2 program accepts

rules or examples in the fonn of an array of attribute-value assignments It can also accept

examples with different numbers of attribute-value pairs (ie examples of different length) To

86

describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was

generated such that they could completely describe any car in the train see Table 4-7 Each train

was described by one example of varying length To recognize the number (position) of a given car

in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies

the location of the car and the second identifies the number of the attribute itself For example the

number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second

attribute (the car shape) In other words attribute x32 is the label of the attribute describing the

shape of the third car

Table 4-7 The set of attributes and their values used in the trains prc)OlCml

stands for the car number (14)

Decision-making situations In the first decision-making situation a decision structure that

classifies any given train either as eastbound or westbound was learned using only attributes

describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out

of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was

hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation

where only attributes describing the second car are used in classifying the trains It correctly

classified 18 of the trains

Both decision structures have leaves with mUltiple decisions which means there are identical first or

second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using

attributes describing the third car only It correctly classifies each of the 14 trains with three cars or

more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were

87

given lower cost than x31 Both decision structures classified the 14 trains with three or more cars

correctly These last two decision structures classified any train with three or more cars correctly

and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)

E

INodeS 4 ILeaves 9

a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2

xli

Leaves 6

c) Decision structure learned using only descriptions of Car 3

Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations

49 Experiments With Small Size Simple and Noisy Problems Congressional

Voting Records (1984)

In the Congressional Voting problem each example was described in terms of 16 attributes There

were two decision classes and a total of 216 examples The experiments tested the change in the

number of nodes and predictive accuracy when varying the number of training examples used for

generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two

window options the default option (maximum of 20 the number of examples and twice the

square root the number of examples) and with 100 window size (one trial per each setting) In

the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16

24 31 39 and 52 of the total number of training examples (216 examples in total half of

the examples were in one class and the second half in the other class)

88

Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984

problem The results indicate that AQJJT-2 generated decision trees had a higher predictive

accuracy and were simpler than decision trees produced by C4S Also the variations of the size of

the AQDT-2s trees with the change of the size of training example set were smaller

Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data

96 10

9 95 I 8 8

~ III 7IPa 94 0

6f Iie 93 S 5 lt

Col

i 492

3

91 2

5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)

a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples

Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2

410 Analysis of the Results

This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis

covers the relationship between different characteristics of the input data and the learning

parameters for both subfunctions of the approach A set of visualization diagrams are used to

t~ bull ~ middotbull bullbull

bullbullbullbullbullbull bull

I=1= ~

~ lJ

bull bullbullbullbull

bull bullbull4i

I I

I 1

bull bull l

1=1=shy ~-canbull

I I I bull

89

illustrate the relationship between concepts represented by decision rules and concepts represented

by decision tree which learned from these rules This section also includes some examples on

describing different decision making situations and the task-oriented decision structures learned for

each situation

Table 4-9 shows the best parameter settings for learning decision rules with different databases

The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy

2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5

and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the

difference in predictive accuracy between two widths of the beam search is less than 2 then the

smaller is better Another one was if the predictive accuracy of different types of covers is

changing (Le for one type of covers it is higher with some widths of beam search or with certain

rules type and lower with others than another type of covers) the best cover is determined

according to the best width of the beam search and the best rule type

Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics

It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most

problems when changing the width of the beam search of the AQl5c system the changes in the

predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better

than intersected rules for learning decision trees Generally decision trees learned from intersected

rules were slightly bigger than those learned from disjoint rules

To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of

heuristics were used to summarize these results These heuristics are 1) if the difference between

90

the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is

considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the

average learning time is within plusmn 01 seconds the learning time is considered the same

Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary

includes comparing the predictive accuracy the size of learned decision trees and the learning

time The value in each cell refers to the system which perform better (possible values are AQDTshy

2 C4S and Same) When the two systems produced similar or close results a letter is associated

with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)

Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if

Some conclusions can be driven from these comparison When the training data represents a small

portion of the representation space AWf-2 produces bigger but accurate decision trees

However C4S produces smaller but less accurate decision trees When the training data

represents a very large portion of the representation space AWf-2 usually produces smaller

decision trees with better accuracy except with noisy data The size of decision trees learned by

C4S relatively grow higher when the training data increases Also C4S works better than AWfshy

2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and

C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be

much less than that of C4S However in some data sets it takes more time because there are some

91

situations where there is no enough information to reach decision the program goes into a loop of

testing all attributes The probabilistic approach for handling this problem is not implemented yet

1b explain the relationship between the input to and the output from AQDT-2 and to explain some

of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of

diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues

Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2

system The experiment contains 169 training examples for both positive and negative decision

classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The

shadded areas represent decision rules of the positive decision class The white areas represents

non-positive coverage

r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215

Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2

92

Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells

with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be

negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this

cell as negative while it should be positive)

rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~

Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31

Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this

diagram cells shadded with m indecate portions of the representation space that were classified as

positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation

space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with

~ represent portions of the representation space where AQDT-2 over generalized decision rules

belong to the positive decision class The decision tree shown by this diagram was learned with

default settings (ie with 10 generalization threshold)

93

This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the

MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy

This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree

UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM

Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2

Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative

errors Cells with bull indicate portions of the representation space with false positive errors Cells

with bull represents portions of the representation space with false negative errors By comparing

Figures 4-34 and 4-32 more errors were occured because of the over generalization

94

rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree

rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2

after reducing the generalization degree to 1

CHAPTER 5 CONCLUSION

51 Summary

TIlls thesis introduces the concept of a decision structure and describes a methodology for

efficiently determining a single-parent structure from a set of decision rules A decision structure is

an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a

given object or situation Having higher expressive power than the familiar decision tree a

decision structure is able to represent some decision processes in a much simpler way than a

decision tree

The proposed methodology advocates storing the decision knowledge in the declarative form of

decision rules which are detennined by induction from examples or by an expert A decision

structure is generated on line when is needed and in the form most suitable for the given

decision-making situation (ie a class of cases of interest) A criticism may be leveled against this

methodology that in order to detennine a decision structure from examples it is necessary to go

through two levels of processing while there exist methods that produce decision trees efficiently

and directly from examples Putting aside the issue that decision structures are more general than

decision trees it is mgued here that this methodology has many advantages that fully justify it The

main advantages include 1) decision structures produced by the methods in the experiments

conducted had higher predictive accuracy and were simpler (sometimes significantly so) than

deCision trees produced from the same data 2) decision structures produced from rules can be

easily tailored to a given decision-making situation ie they can avoid measuring expensive

attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in

the declarative fonn of modular decision rules the methodology makes it easy to modify decision

knowledge to account for new facts or changing conditions 4) the process of deriving a decision

structure from a set of rules is very fast and efficient because the number of rules per class is

95

96

usually much smaller than the number of examples per class and 5) the presented method produces

decision structures whose nodes can be original attributes or constructed attributes that extend the

original knowledge representation (this is due to the application of constructive induction programs

AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate

decision rules first and then create from them decision structures In the AQDJ2 method this first

phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy

based methods were computationally complex the most recent implementation is very fast

(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently

The current method has a number of limitations and several issues need to be investigated further

First of all there is need for further testing of the method Although the experiments conducted so

far have produced more accurate and simpler decision structures than decision trees obtained in a

standard way for the same input data more experiments are necessary to arrive at conclusive

results A mathematical analysis of the method has not been performed and is highly desirable

The current method generates only single-parent decision structures (every node has only one

parent as in a decision tree) Extending the method to generate full-fledged decision structures (in

which a node can have several parents) will make it more powerful It will enable the method to

represent much more simply decision processes that are difficult to represent by a decision tree

(egbull a symmetric logical function) The decision structures produced by the method are usually

more general than the decision rules from which they were created (they may assign decisions to

cases that rules could not classify) Further research is needed to determine the relationship

between the certainty of decision rules and the certainty of decision structures derived from them

The AQ-based program allows a user to generate both characteristic and discriminant decision rules

(Michalski 1983) There is need to investigate the advantages and disadvantages of generating

decision structures from different types of rules

52 Contributions

The dissertation introduces the concept of a decision structure and describes a methodology for

efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of

97

the proposed methcxl is that it allows one to efficiently detennine a decision structure that is

optimized for any given decision-making situation For example when some attribute is difficult

to measure the methcxl creates a decision structure that shows the situations in which measuring

this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision

structure from decision rules in the cases we investigated was negligible Therefore it is easy to

experiment with different criteria for structure generation in order to obtain the most desirable

structure

Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be

simpler and have higher predictive accuracy than when those obtained in a conventional way Le

directly from examples In the experiments involving anificial problems and real-world problems

AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45

decision tree learning program in most problems both in terms of average predictive accuracy and

average simplicity of the generated decision trees

The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the

methcxl is independent it could potentially be applied also with other decision rule leaming

systems or with decision rules acquired from an expert

REFERENCES

98

99

Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity

Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX

Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43

Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University

Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers

Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press

Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press

Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group

Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow

Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8

Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium

Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA

Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press

Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press

Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag

100

Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA

Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12

Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May

Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington

Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press

Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25

Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18

Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK

Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1

Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March

Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)

Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA

Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June

101

Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19

Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers

Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers

Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press

Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press

Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann

Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers

Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)

Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June

Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California

Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August

Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco

Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October

Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers

Vita

Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in

Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He

received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition

Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive

systems knowledge discovery in databases hybrid classification and knowledge-base systems

Page 12: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 13: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 14: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 15: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 16: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 17: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 18: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 19: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 20: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 21: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 22: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 23: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 24: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 25: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 26: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 27: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 28: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 29: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 30: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 31: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 32: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 33: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 34: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 35: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 36: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 37: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 38: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 39: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 40: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 41: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 42: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 43: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 44: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 45: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 46: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 47: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 48: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 49: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 50: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 51: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 52: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 53: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 54: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 55: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 56: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 57: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 58: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 59: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 60: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 61: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 62: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 63: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 64: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 65: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 66: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 67: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 68: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 69: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 70: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 71: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 72: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 73: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 74: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 75: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 76: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 77: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 78: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 79: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 80: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 81: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 82: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 83: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 84: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 85: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 86: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 87: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 88: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 89: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 90: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 91: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 92: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 93: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 94: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 95: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 96: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 97: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 98: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 99: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 100: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 101: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 102: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 103: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 104: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 105: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 106: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 107: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 108: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 109: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 110: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 111: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 112: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 113: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 114: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 115: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 116: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 117: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …
Page 118: DERIVING TASK-ORIENTED DECISION STRUCTURES FROM …