deriving task-oriented decision structures from …
TRANSCRIPT
DERIVING TASK-ORIENTED
DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam
MLI 95middot7
October 1995
DERIVING TASK-ORIENTED DECISION STRUCTURES FROM DECISION RULES
Adissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University
By
Ibrahim M Fahmi Imam
Director Professor Ryszard S Michalski
PRC Chaired Professor of Computer Science amp Systems Engineering School of Information Technology and Engineering
George Mason University
Fall Semester 1995 George Mason University Fairfax Virginia 22030
copy 1995 Copyright by Ibrahim F Imam All rights reserved
ACKNOWLEDGMENT
I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer
Science and Engineering and my Dissertation Director for his support encouragement and
guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the
Department of Information Systems and Software Engineering Professor David Rine Professor
of Information Technology and Engineering and Professor David Schum Professor of Excellence
in Information Technology and Engineering for their encomagement and help with many aspects
of my PhD
I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing
me with application problems Ronny Kohavi Stanford University for discussion and providing
me with some related work on learning decision structures and decisionmiddot graphs and Professor
George Tecuci Computer Science Department for pointing some related work
I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric
Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in
my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for
reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman
for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which
made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and
comparison of different aspects of my work and Janusz Wnek for using his DIAV program for
explaining my results
I would like to thank Professor Andrew P Sage Dean of the School of Information Technology
and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe
President of George Mason University for their support and Professor Murray W Black
Associate Dean of the School of Infonnation Technology and Engineering for guidance on
preparing the PhD proposal
I would like to thank conferences organizers who supported me to attend their conferences and
present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank
Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I
would like also to thank the organizing committee of the Florida Artificial Intelligence Research
Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr
Howard Hamilton Dr John Stewman and Dr Dan Tamir
I would like also to thank the many individuals who helped me in any way during my PhD Those
include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr
Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla
Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner
Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea
Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos
Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
DERIVING TASK-ORIENTED DECISION STRUCTURES FROM DECISION RULES
Adissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University
By
Ibrahim M Fahmi Imam
Director Professor Ryszard S Michalski
PRC Chaired Professor of Computer Science amp Systems Engineering School of Information Technology and Engineering
George Mason University
Fall Semester 1995 George Mason University Fairfax Virginia 22030
copy 1995 Copyright by Ibrahim F Imam All rights reserved
ACKNOWLEDGMENT
I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer
Science and Engineering and my Dissertation Director for his support encouragement and
guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the
Department of Information Systems and Software Engineering Professor David Rine Professor
of Information Technology and Engineering and Professor David Schum Professor of Excellence
in Information Technology and Engineering for their encomagement and help with many aspects
of my PhD
I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing
me with application problems Ronny Kohavi Stanford University for discussion and providing
me with some related work on learning decision structures and decisionmiddot graphs and Professor
George Tecuci Computer Science Department for pointing some related work
I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric
Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in
my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for
reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman
for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which
made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and
comparison of different aspects of my work and Janusz Wnek for using his DIAV program for
explaining my results
I would like to thank Professor Andrew P Sage Dean of the School of Information Technology
and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe
President of George Mason University for their support and Professor Murray W Black
Associate Dean of the School of Infonnation Technology and Engineering for guidance on
preparing the PhD proposal
I would like to thank conferences organizers who supported me to attend their conferences and
present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank
Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I
would like also to thank the organizing committee of the Florida Artificial Intelligence Research
Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr
Howard Hamilton Dr John Stewman and Dr Dan Tamir
I would like also to thank the many individuals who helped me in any way during my PhD Those
include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr
Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla
Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner
Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea
Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos
Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
copy 1995 Copyright by Ibrahim F Imam All rights reserved
ACKNOWLEDGMENT
I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer
Science and Engineering and my Dissertation Director for his support encouragement and
guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the
Department of Information Systems and Software Engineering Professor David Rine Professor
of Information Technology and Engineering and Professor David Schum Professor of Excellence
in Information Technology and Engineering for their encomagement and help with many aspects
of my PhD
I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing
me with application problems Ronny Kohavi Stanford University for discussion and providing
me with some related work on learning decision structures and decisionmiddot graphs and Professor
George Tecuci Computer Science Department for pointing some related work
I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric
Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in
my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for
reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman
for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which
made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and
comparison of different aspects of my work and Janusz Wnek for using his DIAV program for
explaining my results
I would like to thank Professor Andrew P Sage Dean of the School of Information Technology
and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe
President of George Mason University for their support and Professor Murray W Black
Associate Dean of the School of Infonnation Technology and Engineering for guidance on
preparing the PhD proposal
I would like to thank conferences organizers who supported me to attend their conferences and
present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank
Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I
would like also to thank the organizing committee of the Florida Artificial Intelligence Research
Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr
Howard Hamilton Dr John Stewman and Dr Dan Tamir
I would like also to thank the many individuals who helped me in any way during my PhD Those
include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr
Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla
Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner
Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea
Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos
Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
ACKNOWLEDGMENT
I would like to thank Professor Ryszard S Michalski PRC Chaired Professor of Computer
Science and Engineering and my Dissertation Director for his support encouragement and
guidance I would like to thank my committee members Professor Larry Kerschburg Chair of the
Department of Information Systems and Software Engineering Professor David Rine Professor
of Information Technology and Engineering and Professor David Schum Professor of Excellence
in Information Technology and Engineering for their encomagement and help with many aspects
of my PhD
I would like to thank Professor Tomaz Arciszewki System Engineering Department for providing
me with application problems Ronny Kohavi Stanford University for discussion and providing
me with some related work on learning decision structures and decisionmiddot graphs and Professor
George Tecuci Computer Science Department for pointing some related work
I would like to thank my colleagues Nabil Al-Kharouf for reviewing my dissertation Eric
Bloedorn for reviewing an earlier draft of my dissertation and for using his program AQ17-DCI in
my experiments Srinivas Gutta for providing some application for my PhD work Mike Heib for
reviewing an earlier draft of my dissertation and helping me find relevant articles Ken Kaufman
for reviewing an earlier draft of my thesis Mark Maloof for providing me with script files which
made it easier to iteratively run AQ1Sc Halah Vafaie for working with her on application and
comparison of different aspects of my work and Janusz Wnek for using his DIAV program for
explaining my results
I would like to thank Professor Andrew P Sage Dean of the School of Information Technology
and Engineering and Professor Kenneth Bumgarner Dean of Student Services and Associate ViCe
President of George Mason University for their support and Professor Murray W Black
Associate Dean of the School of Infonnation Technology and Engineering for guidance on
preparing the PhD proposal
I would like to thank conferences organizers who supported me to attend their conferences and
present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank
Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I
would like also to thank the organizing committee of the Florida Artificial Intelligence Research
Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr
Howard Hamilton Dr John Stewman and Dr Dan Tamir
I would like also to thank the many individuals who helped me in any way during my PhD Those
include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr
Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla
Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner
Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea
Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos
Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
Associate Dean of the School of Infonnation Technology and Engineering for guidance on
preparing the PhD proposal
I would like to thank conferences organizers who supported me to attend their conferences and
present parts of my PhD work The organizers include Professor Moonis Ali Professor Frank
Anger Professor Zbigniew Ras the American Association for Artificial Intelligence and others I
would like also to thank the organizing committee of the Florida Artificial Intelligence Research
Symposium FLAIRS-95 for organizing my first workshop Thanks goes to Dr Doug Dankel Dr
Howard Hamilton Dr John Stewman and Dr Dan Tamir
I would like also to thank the many individuals who helped me in any way during my PhD Those
include Dr Ashraf Abdel-Wahab Dr Jerzy Bala Dr Alex Brodsky Dr Richard Carver Dr
Kenneth De Jong John Doulamis Tom Dybala Dean Ibrahim Farag Dr Ophir Frieder Csilla
Frakes Dr Mohamed Habib Ali Hadjarian Dr Hugo De Garis Zenon Kulpa Dr Paul Lehner
Dr George Michaels Dr Eugene Norris Dean James Palmer Mitch Potter Dr Ahmed Rafea
Jim Ribeiro Jsyshree Sarma Dr Arun Sood Dr Clifton Sutton Bradley Utz Dr Tibor Vamos
Patricia Zahra Dr Shaker Zahra and Dr Jianping Zhang
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
~J I yarJ I JJ I ~
U-t-oJ ~II ~) ~ oWAIl J
Dedication
To my mother my brothers and my sister
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
TABLE OF CONTENTS
TI1LE Page
ABS1RAcr II 1
CHAPrER 1
IN1RODUCI10N II 3
11 Motivation and Overview 3
12 The Problem Statement 6
CHAPrER2
REIAfED RESEARCFI 7
21 Learning Decision Trees from Decision Diagrams 7
22 Learning Decision Trees from Examples 10
221 Building Decision Trees Using Information-based Criteria 11
222 Building Decision Trees Using Statistics-based Criteria 16
223 Analysis of Attribute Selection Criteria 18
23 Learning Decision Structures 19
CHAPrER3
DESCRIPTION OF TIlE APPROACFI 23
31 General Methodology 23
32 A Brief Description of the AQ15 and AQ17 RuleLeaming Systems 25
33 Generating Decision Structures From Decision Rules 28
331 The AQDT-2 attribute selection method 29
332 The AQDT-2 algoridlm 37
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
vi
333 An example illustrating the algorithm 42
34 Tailoring Decision Structure to a decision-making situation 47
341 Learning Cost-Dependent Decision Structures 49
342 Assigning Decision Under Insufficient Infonnation 49
343 Coping with noise in training data 50
35 Analysis of the AQDT-2 Attribute Selection Criteria 51
36 Decision Structures vs Decision Trees 53
CHAPTER 4
EMPRICAL ANALYSIS AND COMPARATIVE STUDY OF TIlE MElHOD 58
41 Description of the Experimental Analysis 59
42 Experiments With Average Size Complex and Noise-Free Problems
Wind Brac-ings 60
43 Experiments With Small Size Simple and Noise-Free Problems
e MONK-I 69
44 Experiments With Small Size Complex and Noise-Free Problems
MONK-2 76
45 Experiments With Small Size Simple and Noisy Problems
MONK-3 79
46 Experiments With Large Size Complex and Noise-Free Problems
Diagnosing Breast Cancer 83
47 Experiments With Large Size Complex and Noisy Problems
Musm()()m classifications 84
48 Experiments With Small Size Structured and Noise-Free Problems
East-West Trains 85
49 Experiments With Small Size Simple and Noisy Problems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
vii
Congressional Voting Records (1984) 87
410 Analysis of the Results 88
ClIAPfER 5 CONCIUSIONS 95
51 Summary 95
52 Contributions 96
REFERENCES 98
VITA 102
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
viii
LIST OF TABLES
No TITLE Page
2-1 An example of a decision table 9
2-2 A set of training examples used to illustrate the C45 system 15
2-3 The frequency of different attributes values to different decision classes 17
2-4 The expected values of the frequency of examples in Table 2-3 17
2-5 Attribute selection criteria and their basic evaluation measure 17
2-6 The contingency tables of Mingers example 18
2-7 Mingers results for determining the goodness of split 19
2-8 Mingers results for comaring the total accuracy and size of decision trees
provided by different attribute selection criteria from four problems 19
2-9 Comparing the AQDT approach with EDAG and HOODG approaches 22
3-1 The available tools and the factors that affect the process of testing a software 43
3-2 Calculating the disjointness of each attribute 44
3-3 Evaluation of the AQDT-2 criteria on the Quinlans example in Table 2-2 51
3-4 The data used in Mingers f11st experiments 52
3-5 The performance of AQDT-2 criteria (compare with the other criteria in Table 2-6) 52
3-6 The possible ranking domains and using condition of AQDT-2 criteria 53
3-7 Comparison between Decision Structures and Decision Tree 54
4-1 Evaluation of AQDT-2 attribute selection criteria for the wind bracing problem 62
4-2 The predictive accuracy of AQ15c and AQDT-2 for the wind bracing problem 67
4-3 Evaluation of AQDT-2 attribute selection criteria for the MONK-l problem 71
4-4 The predictive accuracy of AQ15c and AQDT-2 for the MONK-l problem 73
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems
ix
4-5 The predictive accuracy of AQ15c and AQDT-2 for the MONK-2 problem 77
4-6 The predictive accuracy of AQ15c and AQDT-2 for the MONK-3 problem 81
4-7 The set of attributes and their values used in the trains problem 86
4-8 A tabular snmmary of the predictive accuracy of decision trees obtained
by AQDT -2 and C45 for the congressional voting data 88
4-9 Summary of the best parameter settings for the first subfunction of the approach
with different data characteristics 89
4-10 Summary of the performance of AQDT-2 and C45 on different problems 90
x
LIST OF FIGURES
No TITLE Page
2-1 An example to illustrate how attributes break rules 8
2-2 A decision tree learned from the decision table in Table 2-1 10
2-3 A decision tree learned using the gain criterion for selecting attributes 15
2-4 Decision rules and their Exception Directed Acyclic Graph (EDAG) 21
3-1 Architecture of the AQDT approach 24
3-2 A ruleset generated by AQ15 for the concept Voting pattern of
Democratic Representatives 27
3-3 Ven Diagrams ofPossible combinations of attribute values in two decision classes 33
3-4 Decision trees corresponding the Ven diagrams in Figure 3-3 33
3-5 Decision trees showing the maximum number of none leaf nodes 41
3-6 Decision rules for selecting the best tool for testing a software 43
3-7 A decision structure learned for classifying software testing tools 45
3-8 A diagrammatic visualization of the decision rules and the derived decision tree 46
3-9 Decision trees learned ignoring the support metric and the type of the testing tool 47
3-10 A decision tree learned without the cost attribute 47
3-11 Decision structures learned by AQDT-2 using different criteria 55
3-12 The Imams example Example where learning decision structures (trees)
from rules is better than learning them from examples 56
3-13 An example where decision rules are simpler than decision trees 57
4-1 Design of a complete experiment 59
Xl
4-2 Decision rules determined by AQ15c from the wind bracing data 61
4-3 A decision tree learned by C45 for the wind bracing data 63
4-4 A decision structure learned from AQ15c wind bracing rules 64
4-5 A decision structure that does not contain attribute xl 64
4-6 A decision structure without Xl with candidate decisions assigned to leaves 65
4-7 A decision structure determined from rules in Figure 4-4 under the
assumption of 10 classification error in the training data 65
4-8 Diagramatic visualization of decision trees learned for different decision
making situations for the wind bracing data 66
4-9 The accuracy of AQDT-2 and AQ15c with different parameter settings
for the wind bracing PIoblem 68
4-10 Analyzing different parameter setting of AQDT-2 using the wind bracing data 69
4-16 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-20 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-11 A comparison between AQ15c and AQDT-2 against C45 on the wind bracing data 69
4-12 A visualization diagram of the MONK-l problem 70
4-13 Decision rules learned by AQ15c for the MONK-1 problem 71
4-14 The decision tree for the MONK-1 problem generated both by AQDT-l 72
4-15 Compact decision structures generated by AQDT-2 for the MONK-1 problem 72
for the MONK-l Problem 74
4-17 Analyzing different parameter setting of AQDT-2 with the MONK-l data 75
4-18 Comparing AQ15c and AQDT-2 against C45 using the MONK-1 problem 75
4-19 A visualization diagram of the MONK-2 problem 76
for the MONK-2 Problem 78
4-21 Analyzing different parameter setting of AQDT-2 with the MONK-2 data 79
xii
4-22 Comparing AQ15c andAQDT-2 against C45 using the MONK-2 problem 79
4-24 The accuracy of AQDT-2 and AQ15c with different parameter settings
4-35 A visualization diagram of a decision tree learned by AQDT -2 after reducing
4-23 A visualization diagram of the MONK-3 problem 80
for the MONK-3 Problem 82
4-25 Analyzing different parameter setting of AQDT-2 with the MONK-3 data 82
4-26 Comparing AQ15c and AQDT-2 against C45 using the MONK-3 problem 83
4-27 Comparing AQ15c and AQDT-2 against C45 using the breast cancer problem 84
4-28 Comparing AQ15c and AQDT-2 against C45 using the mushroom problem 85
4-29 Decision structures learned by AQDT-2 for different decision-making situations 87
4-30 Comparing decision trees for the Cong Voting-84 data learned by C45 amp AQDT -2 88
4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2 91
4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31 92
4-33 A visualization diagram of the decision tree learned by AQDT-2 for the MONK-2 93
4-34 A visualization diagram shows testing errors of AQDT-2 decision tree bull 94
the generalization degree to 1 94
DERIVING TASKmiddotORIENTED DECISION STRUCTURES
FROM DECISION RULES
Ibrahim M Fahmi Imam PhD
School of Information Technology and Engineering
Geotge Mason University Fall 1995
Ryszard S Michalski Advisor
ABSTRACT
This dissenation is concerned with research on learning task-oriented decision structures from
decision rules The philosophy behind this research is that it is more appropriate to learn
knowledge and store it in a declarative form and then when a decision making situation occurs
generate from this knowledge the decision structure that is most suitable for the given decision
making situation Learning decision structures from decision rules was first introduced by
Michalski (1978) The flIst implementation of this approach was done by Imam and Michalski
(1993 ab) called AQDT-l
This approach separates the function of generating a knowledge-base from the function of using
the knowledge-base for decision-making The first function focuses on learning accurate
consistent and complete concept description expressed in a declarative form The second function
is performed whenever a new decision-making situation occurrs a task-oriented decision structure
is obtained to suit that situation Thsk-oriented knowledge is defined as knowledge that is adapted
for solving a given decision-making situation (Imam amp Michalski 1994 Michalski amp Imam
1994)
The dissertation introduces the system AQJJT-2 for learning task -oriented decision structures from
decision rules or examples Each decision making situation is defined by a set of parameters that
controls the learning process of the AQDT-2 system The extensive experiments on AQJJT-2 show
that decision structures learned by it usually outperform in terms of accuracy and average size of
the decision structures those learned from examples by other well known systems The results
show also that the system does not work very well with noisy data The system is illustrated and
compared using applications of artificial problems such as the three MONKs problems (Thrun
Mitchell amp Cheng 1991) and the East-West 1iain problem (Michie et al 1994) It was also
applied to real world problems of learning decision structures in the areas of construction
engineering (for determining the best wind bracing design for tall buildings) medical
diagnosis(for learning decision rules for recognizing breast cancer) agriculture diagnosis (for
learning classification rules for distinguishing between poisonous and non-poisonous
mushrooms) and political data (for characterizing democratic and republican voting records)
CHAPTER 1 INTRODUCTION
11 Motivation and Overview
Learning and discovery systems should be able not only to generate and store knowledge but also
to use this knowledge for decision-making The main step in the development of systems for
decision-making is the creation of a knowledge structure that characterizes the decision-making
process The form in which knowledge can be easily obtained may however differ from the form
in which it is most readily used for decision-making It is therefore important to identify the form
of knowledge representation that is most appropIiate for learning (eg due to ease of its
modification) and the form that is most convenient for decision making
A simple and effective tool for describing decision processes is a decision structure which is a
directed acyclic graph that specifies an order of tests to be appliedto an object (or a situation) to
arrive at a decision about that object The nodes of the structure are assigned individual tests
(which may correspond to a single attribute a function of attributes or a relation) the branches are
assigned possible test outcomes or ranges of outcomes and the leaves are assigned a specific
decision a set of candidate decisions with corresponding probabilities or an undetermined
decision A decision structure reduces to a familiar decision tree when each node is assigned a
single attribute and has at most one parent when the branches from each node are assigned single
values of that attribute and when leaves are assigned single definite decisions Thus the problem
of generating a decision structure is a generalization of the problem of generating a decision tree
Decision trees are typically generated from a set of examples of decisions The essential
characteristic of any such method is the attribute selection criterion used for choosing attributes to
be assigned to the nodes of the decision tree being built Such criteria include the entropy
reduction the gain and the gain ratio (Quinlan 1979 83 86) the gini index ofdiversity (Breiman
et al 1984) and others (Cestnik amp Bratko 1991 Cestnik amp Karalic 1991 Mingers 1989a)
3
4
Adecision treedecision structure representations can be an effective tool for describing a decision
process as long as all the required tests can be performed and the decision-making situations it
was designed for remain constant (eg in a doctor-patient example the doctor should detennine
that answers for all symptoms appear in the decision tree) Problems arise when these assumptions
do not hold For example in some situations measuring certain attributes may be difficult or costly
(eg in the doctor-patient example a brain or blood test is needed which is very expensive or the
tools needed are not available) In such situations it is desirable to refonnulate the decision
structure so that the inexpensive attributes are evaluated frrst (assigned to the nodes close to the
root) and the expensive attributes are evaluated only if necessary (by assignment to the nodes far
away from the root) If an attribute cannot be measured at all it is useful to either modify the
structure so that it does not contain that attribute or-when this is impossible-to indicate
alternative candidate decisions and their probabilities A restructuring is also desirable if there is a
significant change in the frequency of occurrence of different decisions (eg in the doctor-patient
example the doctor may request a decision structure expressed in a specific set of symptoms
biased to classify one or more diseases or specify a certain order of testing)
Arestructuring ofa decision structure (or a tree) in order to suit new requirements could be quite
difficult This is because a decision structure is a procedural representation that imposes an
evaluation order on the tests In contrast no evaluation order is imposed by a declarative
representation such as a set of decision rules Tests (conditions) of rules can be evaluated in any
order Thus for a given set of rules one can usually build a huge number of logically equivalent
decision structures (trees) which differ in the test ordering Due to the lack of order constraints
a declarative representation (rules) is much easier to modify to adapt to different situations than a
procedural one (a decision structure or a tree) On the other hand to apply decision rules to make a
decision one needs to decide in which order tests are evaluated and thus needs a decision
structure
An attractive solution to these opposite requirements is to acquire and store knowledge in a
declarative form and transform it to a decision structure when it is needed for decision-making
5
This method allows one to create a decision structure that is most appropriate in a given decisionshy
making situation Because the number of decision rules per decision class is usually small (each
rule is a generalization of a set of examples) generating a decision structure from decision rules
can potentially be perfonned much faster than by generation from training examples Thus this
process could be done on line without any delay noticeable to the user Such virtual decision
structures are easy to tailor to any given decision-making situation
This approach allows one to generate a decision structure that avoids or delays evaluating an
attribute that is difficult to measure in some decision-making situation or that fits well a particular
frequency distribution of decision classes In other situations it may be unnecessary to generate
complete decision structure but it may be sufficient to generate only a part of it which concerns
only with decision classes of interest Thus such an approach has many potential advantages
This dissertation presents a new system called AQflf-2 The AWf-2 system generates a taskshy
oriented decision structure (decision structure that is adapted to the given decision-making
situation) from decision rules The decision rules are learned by either rule leaming system AQ15
(Michalski et al 1986) or system AQ17-DCI which has extensive constructive induction
capabilities (Bloedorn et al 1993)
To associate the decision rules with a given decision-making task AQflf-2 provides a set of
features including 1) enabling the system to include in the decision structure nodes corresponding
to new attributes constructed during the process of learning the decision rules 2) controlling the
degree of generalization needed during the development of the decision structure 3) providing four
new criteria for selecting an attribute to be a node in the decision structure that allow the system to
generate many different but equivalent decision structures from the same set of rules 4) generating
unknown nodes in situations when there is insufficient infonnation for generating a complete
decision structure 5) learning decision structures from discriminant rules as well as
characteristic rules and 6) providing the most likely decision when the decision process stops
due to inability to evaluate an attribute associated with an intennediate node
6
To test the methooology of generating decision structures from decision rules an extensive set of
planned experiments have been designed to test different aspects of the approach The experiments
include testing different combinations of parameters for each sub-function of the approach
analyzing the relatioship between decision rules and the decision structure learned from them and
comparing decision trees learned by the AQUf2 system the well known C45 (Quinlan 1993)
system for learning decision trees from examples Different experiments were designed to examine
the new features of the AQUf approach for learning task-oriented decision structures The
experiments were applied to artificial domains as well as real-world domains including MONK-I
MONK-2 and MONK-3 (Thrun Mitchell amp Cheng 1991) East-West trains (Michie et al
1994) Engineering Design-wind bracings (Arciszewski et al 1992) Mushrooms Breast
Cancer (Mangasarian amp WolbeJg 1990) and Congressional bting Records of 1984 The
MONKs problems are concerned with learning classification rules for robot-like figures MONK-l
requires learning a DNF-type description MONK-2 requires learning a non-DNF-type description
(one that cannot be easily described as a DNF rule using the original attributes) MONK-3 concerns
learning a DNF rule from noisy data The East-West trains dataset is a structural domain that
classifies two sets of trains (Eastbound and Westbound) The Engineering Design-wind
bracing data involves learning conditions for applying different types of wind bracing for ta11
buildings The Mushrooms data is concerned with learning classification rules for distinguishing
between poisonous and non-poisonous mushrooms The Breast Cancer data involves learning
concept descriptions for recognizing breast cancer The congressional voting data includes voting
records on different issues AQUf2 outperformed C45 on the average with respect to both the
predictive accuracy and tree size for most problems Apf-2 did not work very well with noisy
data or with problems that have many rules covering very few examples
12 The Problem Statement
There are many limitations and problems accompanied using decision trees for decision-making
CHAPTER 2 RELATED RESEARCH
21 Learning Decision Trees from Decision Diagrams
The AQDT-2 method proposed here is based on the earlier work by Michalski (1978) which
introduced an algorithm for generating decision trees from decision lists The method proposed
several attribute selection criteria These criteria are of increasing power of the main criterion
order cost estimate (the nth order cost estimates n=l 2 ) Michalski also analyzed two
specific criteria MAL and DMAL for selecting the optimal attribute for a node in the tree
based on properties extracted from the decision diagram In order to better explain the method
it is necessary to define some terms These terms will be used later in the dissertation
Definition 2-1 A ~ of a decision class C is a set of rules whose union includes the set of all
examples of class C and does not include any examples of other decision classes
Definition 2-2 A ~ of a decision class C is disjoint if all rules (rules) are pairwise logically
disjoint In other words for any two rules there exists a condition with the same attribute but
with different values in each rule
Definition 2-3 An minimal cover is a disjoint cover which has the smallest number of rules
among all possible covers
Definition 2-4 A diagram for a given cover is a table constructed graphically by representing
in two dimension space of all possible combination of attributes values and locating on the
diagram all the condition parts of the given rules and marking them with the action specified
with each rule
Michalskis algorithm requires construction of minimalmiddot cover The minimal cover should be
consistent and complete The method is based on the fact that if there are n decision classes
any decision tree that correctly classifies the given rules and has n distinct leaves is a minimal
7
8
decision tree (any consistent decision tree should have at least n leaves) Michalski (1978) has
shown that if only one rule is broken by a selected attribute then instead of having one leaf
(which could potentially represent this rule or the decision class in the tree) there will have to
be at least two leaves representing this rule in the fmal decision tree
The attribute selection criterion MAL introduced in (Michalski 1978) prefers attributes that do
not break any rules or break as few as possible An attribute breaks a rule if the attribute can
divide the rule into two or more sub-rules Figure 2-1 shows two examples of two sets of rules
In the fJIst xl and x3 break at least two rules each (xl breaks the rule [x4=2] amp [x1=lv3] amp
[x3=2v3] and the rule [x4=3] amp [x1=1v2] amp [x3=1] and x3 breaks three rules the rule [x4=2]
amp [xl=2] amp [x3=2v3]the rule [x4=l] amp [xl=3] amp [x3=lv3] and the rule [x4=3] amp [xl=4] amp
[x3=2v3]) In the diagram on the right xl is the only attribute that does not break any rule
Figure 2-1 An example to illustrate how attributes break rules
One of the criteria defmed by Michalski is the first degree cost estimate which assigns to each
attribute an integer equal to the number of rules broken by that attribute This criterion is also
called the static cost estimate of an attribute or the criterion of minimizing added leaves
(MAL)
-
C3
9
The MAL criterion (Minimizing Added Leaves) seeks an attribute that minimizes the
estimated number of additional nodes in the decision tree being generated over a hypothetical
minimal decision tree When there is a tie between two attributes the attribute to be selected is
the one which breaks smaller rules (rules that cover fewer examples or more specialized
rules) AQDT-2 uses an approximate version of this criterion (the attribute dominance)
Another criterion introduced by Michalski was the DMAL criterion The DMAL criterion
(Dynamically Minimizing Added Leaves) is based on a principle similar to that of MAL but
is more complex because once an attribute is selected as a node in the tree some rules andor
parts of the broken rules at each branch are merged into one rule The DMAL ensures that the
value of the total cost estimate of an attribute is decreased by a value equal to the number of
merged rules minus one
Example Learn a decision tree from the following decision table
The minimal cover consists of the following rules
Al lt [x2=O] v [x1=0][x2=2] A2 lt [x2=I] v [xl=2][x2=2] A3 lt [xl=I][x2=2]
The evaluations ofthe MAL criterion for these attributes are 2 for xl 0 for x2 5 for x3 and 5
for x4 The attribute xl divides the two rules [x2=O] and [x2=I] so selecting it will add two
leaves to the optimal number of leaves It is clear that the attribute to be selected as a root of
10
the decision tree is x2 Then three branches are attached to the root node and the decision rules
are divided into subsets each corresponding to one branch For x2 = 0 or 1 a leave node is
generated For x2=2 another attribute is selected to be a node in the tree In this case xl has the
minimum MAL value Figure 2-2 shows the decision tree obtained using the MAL criterion
Al A3 A2
Figure 2-2 A decision tree learned from the decision table in Table 2-1
22 Learning Decision Trees from Examples
Decision tree learning is a field that concerned with generating decision tree that classifies a set
of examples according to the decision classes they belong to The essential aspect of any
inductive decision tree method is the attribute selection criterion The attribute selection
criterion measures how good the attributes are for discriminating among the given set of
decision classes The best attribute according to the selection criterion is chosen to be assigned
to a node in the tree The fIrst algorithm for generating decision trees from examples was
proposed by Hunt Marin and Stone (1966) Hunts algorithm uses a divide and conquer
algorithm for building decision trees This algorithm has been subsequently modified by
Quinlan (1979) and applied by many researchers to a variety of learning problems
Attribute selection criteria can be divided into three categories These categories are logicshy
based information-based and statistics-based The logic-based criteria for selecting attributes
use logical relationships between the attributes and the decision classes to determine the best
attribute to be a node in the decision tree such as the MAL criterion minimizing added leaves
(Michalski 1978) which uses conjunction and disjunction operators The information-based
criteria are based on the information theory These criteria measure the information conveyed
11
by dividing the training examples into subsets Examples of such criteria include the
information measure 1M the entropy reduction measure and the gain criteria (Quinlan 1979
83) the gini index of diversity (Breiman et al 1984) Gain-ratio measure (Quinlan 1986) and
others (Clark amp Niblett 1987 Bratko amp Lavrac 1987 Cestnik amp Karalic 1991) The
statistics-based criteria measure the correlation between the decision classes and the attributes
These criteria use statistical distributions for determining whether or not there is a correlation
The attribute with the highest correlation is selected to be a node in the tree Examples of
statistics-based criteria include Chi-square and G-statistic (Sokal amp Rohlf 1981 Han 1984
Mingers 1989a)
Niblett and Bratko (1986) Quinlan (1987) and Bratko and Kononeko (1987) extended the
method of learning decision trees to also handle data with noise (by pruning) Handling noise
extended the process of learning decision trees to include the creation of an initial complete
decision tree tree pruning which is done by removing subtrees with small statistical Validity
and replacing them by leaf nodes (MiDgers 1989b) More recently pruning has also been used
for simplifying decision trees even for problems without noise (Bohanec amp Bratko 1994)
Pruning decision trees improves their simplicity but reduces their predictive accuracy on the
training examples Quinlan (1990) proposed also a method to handle the unknown attributeshy
value problem by exploring probabilities of an example belonging to different classes
The rest of this section includes a brief description of the attribute selection criterion used by
C45 learning system (Quinlan 1993) C45 uses an information-based criterion for selecting
an attribute to be a node in the tree Also the section includes a brief description of the Chishy
square method for attribute selection (Mingers 1989a) The method is a statistics-based
method for selecting an attribute to be a node in the tree
221 Building Decision Trees Using htformation-based Criteria
12
This section presents a description of the inductive decision tree learning system C4S The
C4S learning system is considered to be one of the most stable accurate and fastest program
for learning decision trees from examples
Learning decision trees from examples requires a collection of examples Each example is
represented by a fixed number of attribute-value pairs C45 (Quinlan 1993) is a learning
program that induces classification decision trees from a set of given examples The C4S
learning system is descended from the learning system ID3 (Quinlan 1979) which is based on
Hunts method for constructing decision tree from a set ofcases (Hunt Marin amp Stone 1966)
The C45 system uses an attribute selection criterion called the Gain Ratio This criterion
calculates the gain in classifying information based on the residual information needed to
classify cases in a set of training examples and the information yielded by the test based on the
relative frequencies of the possible outcomes (decision classes) The gain ratio criterion is
based on an earlier criterion used by ID3 called the Gain Criterion The Gain Criterion uses
the frequency of each decision class in the given set of training examples
Once an attribute is chosen to be a node in the tree the system generates as many links as the
number of its values and classifies the set of examples based on these values If all the
examples at a certain node belong to one decision class the system generates a leaf node and
assigns it to that class Otherwise the system searches for another attribute to be a node in the
tree
The Gain Criterion The gain criterion is based on the information theory That is the
information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm (base 2) of that probability To explain the gain criterion suppose that for
a given problem Xu bullbull X are given attributes and Cu c are decision classes Suppose S is
any set of cases and T is the initial set of training cases The frequency of class C1 in the set S
is the number of examples in S that belong to class C i bull
13
freq(Cit S) = Number of examples in S belong to CI (2-1)
Suppose that lSI is the total number of examples in S the probability that an example selected
at random from S belongs to class Ci is freq(CIbullS) IISI
The information conveyed by the message that a selected example belongs to a given decision
class Cit is determined by -log1 (freq(Ch S) IISI) bits
The expected information from such a message stating class membership is given by
info(S) = -k
L (freq(C S) I lSI) log1 (freq(C S) IISI) bits (2-2)I
info(S) is known also as the entrQpy of the set S When S is the initial set of training examples
info(T) determines the average amount of information needed to identify the class of an
example in T
Suppose that we selected an attribute X to be the root of the tree and suppose that X has k
possible values The training set T will be divided into k subsets each corresponding to one of
Xs values The expected information of selecting X to partition the training set T infoz(T) can
be found as the sum over all subsets of multiplying the information conveyed by each subset
by its probability
k
infoz(T) =L (ITII 1m) infO(TIraquo (2-3) I
The information gained by partitioning the training examples T into subset using the attribute
X is given by
gain(X) =inlo(T) - inoIT) (2-4)
The attribute to be selected is the attribute with maximum gain value
The Gain Ratio Criterion This criterion indicates the proportion of information generated by
the split that appears helpful for classification Quinlan (1993) pointed out that the gain
criterion has a serious deficiency Basically it is strongly biased toward attributes with many
outcomes (values) For example for any data that contains attributes such as social security
14
number the gain criterion will select that attribute to be the root of the decision tree However
selecting such attributes increases the size of the decision tree Quinlan provided a solution to
this problem by introducing the gain ratio criterion which takes the ratio of the information that
is gained by partitioning the initial set of examples T by the attribute X to the potential
information generated by dividing T into n subsets
Following similar steps to obtain the information conveyed by dividing T into n subsets The
expected information generated by dividing T into n subsets and by analogy to equation 2-2 is
determined by
split info(T) = - Lt
(ITII lTI) 10gl (ITII m ) (2-5) I
The gain ratio is given by
gain ratio(X) = gain(X) I split info(X) (2-6)
and it expresses the proportion of information generated by the split that is useful for
classification
Example Consider the following example presented by Quinlan (1993) Table 2-2 shows the
set of training examples
First determine the amount of information gained by selecting the attribute outlook to be a
root of the decision tree This attribute divides the training examples into three subsets
sunny-with five examples two of which belong to the class Play overcast-with four
examples all of which belong to the class Play and rain-with five examples three of
which belong to the class Play To determine info(11 the average information needed to
identify the class of an example in T there are 14 training examples and two decision classes
Nine of these examples belong to the class Play and five belong to the class Dont Play
Info(11 = - 914 10gl (914) - 514 10gl (514) = 094 bits
When using outlook to divide the training examples the information becomes
15
infO(T) = 514 -25 log2 (25) - 35 10g2 (35)
+ 414 -44 10g2 (44) - 04 log2 (04)
+ 514 -35 logl (35) - 25 logl (25) = 0694 bits
By substituting in equation 2-4 the gain of information results from using the attribute
outlook to split the training examples equal to 0246 The gain information for windy is
0048
Figure 2-3 shows a decision tree learned for this problem using the gain criterion The split
information for outlook is determined as follows
split info(T) = - 514 logl (514) - 414 log2 (414) - 514 log2 (514) = 1577 bits
The gain ratio for outlook = 02461577 =0156
16
Figure 2 -3 A decision tree learned using the gain criterion for selecting attributes
The C45 system handles discrete values as well as continuous values To handle an attribute
with continuous values C45 uses a threshold to transform the continuous domain into two
intervals In other words for each continuous attribute C45 generates two branches one
where the values of that attribute is greater than the determined threshold and the other if the
value is less than or equal to the threshold
Tree pruning in C45 is a process of replacing subtrees with small classification validity by
leaves The C45 system uses Laplace ratio for determining the error rate of different subtrees
This ratio is defined as (e+l) (n+2) where n is the number of the training examples and e is
the number of misclassified examples at a given leaf
222 Building Decision Trees Using Statistics-based Criteria
The Chi-square method for selecting attributes was used by Hart (1984) and Mingers (1989a)
in building decisionmiddot trees The method uses Chi-square statistics to measure the association
between two attributes When building decision trees the method is implemented such that it
determines the association between each attribute and the decision classes The attribute to be
selected is the one with greatest value
To determine the Chi-square value for an attribute consider aij is the number of examples in
class number i where the attribute A takes value numberj In other words aij is the frequency
of the combination of decision class number i and the attribute value number j The Chi-square
value for attribute A is given by
n m
Chi-square (A) =L L [ (aij - Eijl2 Eij ] (2shy
i=1 j=1
where n is the number of decision classes m is the number ofvalues of a given attribute Also
17
Eij = (TCi TVj ) T (2shy
8)
where T Ci and T Vj are the total number of examples belong to the decision class a and the total
number of examples where the attribute A takes value vj respectively T is the total number of
examples
Consider Quinlans example in Table 2-2 Table 2-3 shows the frequency of different
combination of values between the decision class and both the outlook and the Windy
attributes Table 2-4 shows the expected values TCi and T Vj of the frequencies in Table 2-3
of different attributes values for different decision classes
To determine the association value between the decision classes and both the attribute Windy
and the attribute Outlook the observed Chi-square values are
Chi-square (Windy Class) =[(3-39)239] + [(3-21)221] + [(6-51)232] + [(2-29)232]
=021 + 039 + 025 + 025 = 11
Chi-square (Outlook Class) = [(2-32)232] + [(4-26)226] + [(3-32)232] + [(3-18118]
+ [(0-14)214] + [(2-18)2 18] =045 + 075 + 02 + 08 + 14 + 002 = 362
18
Applying the same method to the other attributes the results will favor the attribute Outlook
Once that attribute is selected to be a node in the tree the remaining set of examples are divided
into subsets and the same process is repeated on each subset
Table 2-5 shows a summary of these criteria and their basic evaluation function
Table 2-5 Attribute selection criteria and their basic evaluation measure
Info Measure (IM) Gain Gshystatistics and Gain Ratio Chi-square
II
Entropy(S) = - L (freq(~ 5) 1151) logz (freq(CIt 5) 1151 ) I
G-Statistics =2N 1M (N-No of examples) n m
Chi-square (A B) = L L [(aij - ~j)21Eij ] i=l
223 Analysis of Attribute Selection Criteria
This subsection introduces briefly the analysis of different selection criteria which was done by
Mingers (1989a) Mingers compared six attribute selection criteria that are used in decision
tree programs These criteria are the Information Measure (IM) Chi-square G statistics Gin
index of diversity Marshal correction and Gain RaJio The overall results show that the Gain
Ratio criterion had the strongest xesults
In the flISt experiment Minger tested the six criteria on a problem with ambiguous examples
(ie examples may belong to more than one decision class) to observe how the selected criteria
evaluate the given attributes The problem has two decision classes and two attributes X and
Y It was assumed that attribute X is better for classifying the examples than attribute Y The
training examples were unevenly spread between the two values of X Attribute Y has three
values and the examples were spread randomly among them Table 2-6 (a and b) shows the
contingency tables for both attributes Table 2-7 shows a summary of the goodness of split
provided by the six criteria Mingers noted that the measures that are not based on information
19
theory give radiation (attribute X here) less weighL This may be because the zero in the fIrst
row of radiation has a greater influence in the log calculation In the case of using the Chishy
square criterion the value zero adds the maximum association between any two attributes
because the Chi-square value of a zero cell is the expected value of this cell
b)
Now let us demonstrate results from another experiment done by Mingers In this experiment
Mingers used four different data sets to generate decision trees for eleven different criteria In
the final results he compared the total number of nodes and the total error rate provided by
each criterion over all given problems Table 2-8 shows the final results for five selected
criteria only
20
U lgt ~middot results for comparing the total accuracy and size of decision trees different attribute selection criteria from four VVLU
This experiment was performed on four real world data sets These data are concerned with
profiles of BA Business Studies degree students recurrence of breast cancer classifying types
of Iris and recognizing LCD display digits The data was divided randomly 70 for training
and 30 for testing For more details see Mingers (1989a)
23 Learning Decision Structures
Considering the proposed defInition of decision structures given above two related lines of
research are described in this section Gaines (1994) and Kohavi (1994) proposed two
approaches for generating decision structures that share some of the earlier ideas by Imam and
Michalski (1993b)
In the first approach Brian Gaines introduces a method for transforming decision rules or
decision trees into exceptional decision structures The method builds an Exception Directed
Acyclic Graph (EDAG) from a set of rules or decision trees The method starts by assigning
either a rule say RO or a conclusion say CO or both to the root node If it assigns a conclusion
to the root node it places it on a temporary conclusion list Then it generates a new child node
and add to it a new rule say Rl The method evaluates the new rule Rl if it satisfies the
conclusion CO in its parent node it builds a new child and repeats the processes Otherwise if
the rule Rl does not satisfy the conclusion CO it changes the temporary memory with the new
conclusion Cl which is satisfied by the rule RO The same process is repeated until all rules
that have common conditions with the rule at the root are evaluated The method then creates a
21
new child node from the root and repeat the process until all rules are evaluated In the decision
structure nodes containing rules only represent common conditions to all of its children
The main disadvantages of this approach is that it requires discriminant rules to build such a
decision structure Also such a structure is more complex than the traditional decision trees
that are used for decision-making
Figure 2-4 shows an example of set of rules and their equivalent exceptional decision structure
The decision structure can be read as follows It is Safe except if xl=1 amp x2=1 amp x3=1 and
either x4=3 or x5=I then it is Lost except if x6=1 it is Safe except if x7=1 it is Lost
The second approach introduced by Ronny Kohavi learns decision structures from examples
using a bottom-up algorithm The method uses a greedy Hill-climbing algorithm for inducing
Oblivious read-Once Decision Graphs (HOODO) A read-once decision graph is defined as a
decision graph where each attribute occurs at most once along any computational path In other
words for each path from the root to the leaves of the decision structure attributes may occur
as a node once at most However there may be more than one node with the same attribute in
the decision graph Kohavi also defined a leveled decision graph in which the nodes are
partitioned into a sequence of pairwise disjoint sets such that outgoing edges from each level
terminate at the next level An oblivious decision graph is a decision graph where all nodes at
a given level are labeled by the same attribute
22
Safe lt [xl=2] Safe lt [x2=2] Safe lt [x3=2] Safe lt [x4=1] amp [x5=2] Safe lt [x4=1] amp [x5=3] Safe lt [x6=l] amp [x7=2] Safe lt [x6=1] amp [x7=3] Safe lt [x4=2] amp [x5=2] Safe lt [x4=2] amp [x5=3]
Lost lt [x6=2] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x6=3] amp [x5=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x5=1] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x7=1] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=2] amp [x3=1] amp [x2=I] amp [xl=l] Lost lt [x4=3] amp [x6=3] amp [x3=1] amp [x2=I] amp [xl=l]
The algorithm starts by generating a leaf node for each decision class Then it applies a
nondeterministic method to select an attribute to build new level of the decision structure This
attribute is removed from the data and the data is divided into subsets each corresponding to a
complex combination of that attributes values For each subset the process is repeated until all
the examples of a given subset belong to one decision class For example if the selected
attribute say A has two values (0 and 1) and there are two decision classes (CO and CI) The
data is divided into two subsets the fIrst subset contains examples where A takes value 0 and
belong to class CO or takes value 1 and belong to class Cl The second subset of examples is
the set where A takes value 0 and belong to class CI or takes value 1 and belong to class CO
The number of nodes of the fIrst level (after the leave nodes) is expected to be less than or
equal to ka where k is the number of decision classes and n is the number of values of the
selected attribute and it can increase exponentially before that number is reduced
exponentially to one
It is easy for the reader to fIgure out some major disadvantages of such approach including
The average size of such decision structures is estimated to be very large especially when there
23
is no similarity (ie strong patterns) or logical relationship in the data The time used to learn
such decision structure is relatively very high comparing to systems for learning decision trees
from examples and fmally it could be better to search for attribute which reduces the number
of generated subsets of the data instead of nondeterministically selecting an attribute to build a
new level of the decision structure Kohavi provided a comparison between the C45 system
and his system that can be found in (Kohavi 1994) Later Kohavi and Li (1995) used a
deterministic method for attribute selection which minimizes the width of the penultimate level
of the graph
Table 2-9 shows a comparison between the proposed approach and those two approaches The
EDAG and HOODG systems are unreleased prototype systems
or Decision structures are easy to Decision structures are Decision structures are easy to understand difficult to read understand
CHAPTER 3 DESCRIPTION OF THE APPROACH
31 General Methodology
In the proposed approach the function of learning or discovery is separated from the function of
using the discovered knowledge for decision-making The frrst function is performed by an
inductive learning program that searches for knowledge relevant to a given class of decisions
and stores the learned knowledge in the form of decision rules The second function is performed
when there is a need for assigning a decision to new data points in the database (eg bull a
classification decision) by a program that transforms obtained knowledge into a decision
structure optimized according to the given decision-making situation
The Learning Task GivenA set of training examples describing the concept to be learned Learning goal which specifies the decision classes to be learned from the training examples A background knowledge to control the learning process Determine A concept description in a declarative form of knowledge (decision rules) that satisfies the learning goal
The Decision-making Task GivenA set of decision rules in a conjunctive form A description of the new decision-making situation (eg attributes costs and order preference importance or frequency of decision classes etc) One or more examples that needed to be tested under the given decisionshymaking situation A set of parameters to control the learning process Determine A decision structure that suits the given decision-making situation
The decision rules used here are learned by either the AQ1S (Michalski et al 1986) or AQ17
(Bloedorn et aI 1993) learning systems The reasons for selecting decision rules as a form of
23
24
declarative knowledge are that they do not impose any order on the evaluation of the attributes
and due to the lack of order constraints decision rules can be evaluated in many different
ways which increases the flexibility of adapting them to the different tasks of decision-making
(Bergadano et al 1990) Since the number of rules learned per class is typically much smaller
than the number of examples per class generating a decision structure from decision rules can
potentially be done on line
Such virtual decision structures can be tailored to any given decision-making situation The
needed decision rules have to be generated only once and then they can be used many times for
generating decision structures according to changing requirements of decision-making tasks The
method uses the AQDT-2 system (Imam amp Michalski 1994) for learning decision structures
from decision rules Decision structures represent a procedural form of knowledge which makes
them easy to implement but also harder to change Consequently decision structures can be quite
effective and useful as long as they are used in decision-making situations for which they are
optimized and the attributes specified by the decision structure can be measured without much
cost Figure 3-1 shows an architecture of the proposed methodology
Data
Decision -
- Learning knowledge from database The decision making process
Figure 3 -1 Architecture of the AQDT approach
25
It is assumed that the database is not static but is regularly updated A decision-making problem
arises when there is a case or a set of cases to which the system has to assign a decision based on
the knowledge discovered Each decision-making situation is defmed by a set of attribute-values
Some attribute-values may be missing or unknown A new decision structure is obtained such
that it suits the given decision-making problem The learned decision structure associates the
new set of cases with the proper decisions
32 A Brief Description of the AQIS and AQ17 Rule Learning Programs
The decision rules are generated from examples by an AQ-type inductive learning system
specifically by AQ15 (Michalski et albull 1986) or AQI7-DCI (Bloedorn at al 1993) The rest of
this subsection includes a brief description of the inductive learning systems AQ15 and AQI7
AQ15 learns decision rules for a given set of decision classes from examples of decisions using
the STAR methodology (Michalski 1983) The simplest algorithm based on this methodology
called AQ starts with a seed example of a given decision class and generates a set of the
most general conjunctive descriptions of the seed (alternative decision rules for the seed
example) Such a set is called the star of the seed example The algorithm selects from the star
a description that optimizes a criterion reflecting the needs of the problem domain If the
criterion is not defined the program uses a default criterion that selects the description that
covers the largest number of positive examples (to minimize the total number of rules needed)
and with the second priority that involves the smallest number of attributes (to minimize the
number of attributes needed for arriving at a decision)
If the selected description does not cover all examples of a given decision class a new seed is
selected from uncovered examples and the process continues until a complete class description
is generated The algorithm can work with few examples or with many examples and can
optimize the description according to a variety of easily-modifiable hypothesis quality criteria
26
The learned descriptions are represented in the form of a set of decision rules expressed in an
attributional logic calculus called variable-valued logic 1 or VLl (Michalski 1973) A
distinctive feature of this representation is that it employs in addition to standard logic operators
the internal disjunction operator (a disjunction of values of the same attribute in a condition) and
the range operator (to express conditions involving a range of discrete or continuous values)
These operators help to simplify rules involving multi valued discrete attributes the second
operator is also used for creating logical expressions involving continuous attributes
AQ15 can generate decision rules that represent either characteristic or discriminant concept
descriptions depending on the settings of its parameters (Michalski 1983) A characteristic
description states properties that are true for all objects in the concept The simplest
characteristic concept description is in the form of a single conjunctive rule (in general it can be
a set of such rules) The most desirable is the maximal characteristic description that is a rule
with the longest condition part ie stating as many common properties of objects of the given
class as can be determined A discriminant description states properties that discriminate a given
concept from a fixed set of other concepts The most desirable is the minimal discriminant
descriptions that is a rule with the slwnest condition part For example to distinguish a given
set of tables from a set of chairs one may only need to indicate that tables have large flat top
A characteristic description of the tables would include also properties such as have four legs
have no back have four corners etc Discriminant descriptions are usually much shorter than
characteristic descriptions
Another option provided in AQ15 controls the relationship among the generated descriptions
(rulesets or covers) of different decision classes In the IC (Intersecting Covers) mode
rulesets of different classes may logically intersect over areas of the description space in which
there are no training examples In the DC (Disjoint Covers) mode descriptions of different
classes are logically disjoint The DC mode descriptions are usually more complex both in the
number of rules and the number of conditions There is also a DL mode (a Decision List mode
27
also called VL modeM-for variable-valued logic mode) in which the program generates rule sets
that are linearly ordered To assign a decision to an example using such rulesets the program
evaluates them in order If ruleset is satisfied by the example then the decision is made
otherwise the program proceeds to the evaluation of the ruleset +1 In IC and DC modes
rule sets can be evaluated in any order
Alternatively the system can use rules from the AQ17-DCI program for Data-driven
Constructive Induction AQ17-DCI differs from AQ15 mainly in that it contains a module for
generating additional attributes These attributes are various logical or mathematical
combinations the original attributes The program generates a large number of potential new
attributes and seleets from them those most promising based on an attribute quality criterion
To illustrate the format of rules generated by AQ15 (or AQ17-DCI) an exemplary ruleset is
shown in Figure 3-2 The ruleset (that can be re-represented as a disjunctive normal form
expression) describes a voting record of Democratic Representatives in the US Congress
Each rule is a conjunction of elementary conditions Each condition expresses a simple relational
statement For example the condition [State = northeast v northwest] states that the attribute
State (of the Representative) should take the value northeast or northwest to satisfy the
condition
Rl [Gas_concban = yes] amp [Soc_see_cut = no v not registered] R2 [Draft = Yes v not registered]amp[AlaslaLparks = yes v not registered] amp
[Food_stamp_cap = no] amp [State = northeast v northwest] R3 [Chrysler =yes v not registered] amp [Income =low] R4 [Education =yes] amp [Occupation = yes]
Figure 3-2 A ruleset generated by AQ15 for the concept Voting pattern of Democratic Representatives
The above rules were generated from examples of the voting records For illustration below is
an example of a voting record by a Democratic representative
Draft registration=no Ban of aid to Nicaragua=no Cut expenditure on mx missiles=yes Federal subsidy to nuclear power stations=yes Subsidy to national parks
28
in Alaska=yes Fair housing bill=yes Limit on Pac contributions=yes Limit onood stamp program=no Federal help to education=no StateFrom=north east State Population=large Occupation =unknown Cut in social security spending=no Federal help to Chrysler corp=not registered
By expressing elementary statements in the example as conditions and linking conditions by
conjunction the examples can be re-expressed as decision rules Thus decision rules and
examples formally differ only in the degree of generality
33 Generating Decision Structurestrees from Decision Rules
This section describes the AQDT-1 system for learning decision structures (trees) from decision
rules (Imam amp Michalski 1993ab) Also a description of the AQDT-2 method for learning taskshy
oriented decision structure from decision rules is included and fmally the methodology is
illustrated by two examples
Methods for learning decision trees from examples have been very popular in machine learning
due to their simplicity Decision trees built this way can be quite efficient as long as they are
used in decision-making situations for which they are optimized and these situations remain
relatively stable Problems arise when these situations significantly change and the assumptions
under which the tree was built do not hold anymore For example in some situations it may be
difficult to determine the value of the attribute assigned to some node One would like to avoid
measuring this attribute and still be able to classify the example if this is potentially possible
(Quinlan 1990) If the cost of measuring various attributes changes it is desirable to restructure
the tree so that the inexpensive attributes are evaluated f11st A tree restructuring is also
desirable if there is a significant change in the frequency of occurrence of examples from
different classes A restructuring of a decision tree to suit the above requirements is however
difficult to do The reason for this is that decision trees are a form of decision structure
representation and imposes constraints on the evaluation order of the attributes that are not
logically necessary
29
One problem in developing a method for generating decision structures from decision rules is to
design an attribute selection criterion that is based on the properties of the rules rather than of
the training examples A decision rule normally describes a number of possible examples Only
some of them are examples that have actually been observed ie training examples An attribute
selection criterion is needed to analyze the role of each attribute in the rules It cannot be based
on counting the numbers of training examples covered by each attribute-value and the
frequency of decision classes in the training examples as is done in learning decision trees from
examples because the training examples are assumed to be unavailable
Another problem in learning decision trees from decision rules stems from the fact that decision
rules constitute a more powerful knowledge representation than decision trees They can directly
represent a description in an arbitrary disjunctive normal form while decision trees can represent
directly only descriptions in the disjoint disjunctive normal form In such descriptions all
conjunctions are mutually logically disjoint Therefore when transforming a set of arbitrary
decision rules into a decision tree one faces an additional problem of handling logically
intersecting rules
The solution to both problems (attribute selection and logically intersected rules) in the AQDT-2
system is based on the earlier work by Michalski (1978) which introduced a general method for
generating decision trees from decision rules The method aimed at producing decision trees
with the minimum number of nodes or the minimum cost (where the cost was defined as the
total cost of classifying unknown examples given the cost of measuring individual attributes and
the expected probability distribution of examples of different decision classes) More
explanations are provided in the following section
331 Tbe AQDT-2attribute selection metbod
This section describes the AQDT -2 method for building a decision structure from decision rules
The method for building a single-parent decision structure is similar to that used in standard
30
methods of building a decision tree from examples The major difference is that it assigns tests
(attributes) to the nodes using criteria based on the properties of the decision rules (this includes
statistics about the examples covered by each rule in the case of learning rules from examples)
rather than statistics characterizing the frequency of training examples per decision classes per
attribute-values or per conjunctions of both Other differences are that the branches may be
assigned an internal disjunction of values (not only a single value as in a typical decision tree)
and leaves may be assigned a set of alternative decisions with probabilities Also the tests can be
attributes or names standing for logical or mathematical expressions that involve several
attributes or variables In the following we use the terms test and attribute interchangeably
(to distinguish between an attribute and a name standing for an expression the latter is called a
constructed attribute)
At each step the method chooses the test from an available set of tests that has the highest utility
(see below) for the given set of decision rules This test is assigned to the node The branches
stemming from this node are assigned test values or disjoint groups of values (in the form of
logical disjunction if such occur in the rules subsumed groups of values are removed) Each
branch is associated with a reduced set of rules determined by removing conditions in which the
selected attribute assumes value(s) assigned to this branch If all rules in the reduced ruleset
indicate the same decision class a leaf node is created and assigned this decision class The
process continues until all nodes are leaf nodes If it is not possible to reduce further the rule set
because some attribute is declared as unavailable (infmite cost) then the leaf is assigned a set of
candidate decisions with associated probabilities (see Sec 42)
The test (attribute) utility is a combination of one or more of the following elementary criteria 1)
cost which indicates the cost of using each attribute for making decision 2) disjointness which
captures the effectiveness of the test in discriminating among decision rules for different decision
classes 3) importance which determines the importance of a test in the rules 4) value
31
distribution which characterizes the distribution of the test importance over its of values and 5)
dominance which measures the test presence in the rules These criteria are dermed below
~ The cost of a test expresses the effort or cost needed to measure or apply the test
Djsjointness The disjointness of a test is defined as the sum of the class disjointness-the
disjointness of the test for each decision class Suppose decision classes are Cl C2 bull Cm and
decision rule sets for these classes have been determined Given test A let V 1 V 2 bull V m denote
sets of the values (outcomes) of A that are present in rule sets for classes C 1 C2bullbull Cm
respectively If a ruleset for some class say Ct contains a rule that does not involve test A than
Vt is the set of all possible values of A (the domain of A)
Definition 3 -1 The degree ofclass disjointness D(A Ci ) of test A for the ruleset ofclass Ci is
the sum of the degrees of disjointness D(A Cit Cj) between the ruleset for Ci and rulesets for
Cj j=l 2 m j J L The degree of disjointness between the ruleset for Ci and the ruleset for Cj is
dermed by
if Vi Vj ifVj gtVj (3-1) ifVinVj -tljgt amp -tVi amp -tVj ifVinVj=1jgt
where cjgt denotes the empty set Note that exchanging the second and third conditions of equation
(3-1) may seem to be an improved criteria However it does not clearly distinguish between both
cases (Le for both situations the disjointness will be similar) The current equation is better
because it gives higher scores to attributes that classify different subsets of the two decision
classes than attributes that classify only a subset of one decision class
Definition 3-2 The disjointness of the test A for evaluating a given set of decision rules is the
sum of the degrees of class disjointness of each decision class
32
m m Disjointness(A) = L D(A cn where D(A cn = L D(A Ci Cj) (3-2)
i=l i=l i~j
The disjointness of a test ranges from 0 when the test values in rulesets of different classes are
all the same to 3m(m-l) when every rule set of a given class contains a different set of the test
values If two tests have the same disjointness value the attribute to be selected is the one with
less number of values
Definition The A vera~ Number of Tests (ANT) required to make decision from a decision tree
is dermed as the average number of tests (attributes) to be examined from the root of the tree to
any leaf node in order to reach a decision
Definition A decision structure is a one-node-per-level decision structure if at each level there is
only one node and zero or more leaves
Such decision structure can be generated by combining together all branches that associated with
each one a set of decision rules belongs to more than one decision class
Theorem 1 Consider learning one-node-level decision structure for a database with two
decision classes The disjointness criterion ranks first attributes that add minimum number of
tests to the decision tree
Proof Consider all possible distributions of an attributes values within two decision classes Ci
and Cj There are three cases 1) subset (same as super set) 2) non-empty intersection but not
subset 3) no intersection Figure 3-3 shows all possible distributions (the case of having the
same set of values in both classes is trivial one) Consider that branches leading to one subset
with the same decision class are combined into one branch In the first case there will be two
branches only The first leads to a leaf node and the other leads to an intermediate node where
another attribute to be selected The minimum ANT in this case is 53 In the second case three
branches should be created Two branches leads to a leaf node where all values at each branch
belong to only one and different decision class The third branch leads to an intermediate node
33
where another attribute should be selected furtur classifies the decision classes The minimum
ANT in this case is 64 In the third case only two branches will be generated where each leads
to leaf node with different decision class In this case the minimum ANT is 1
Figure 3-4 shows decision trees equivalent to selecting the corresponding attribute Note that in
case of having more than one attribute-value at branches lead to leaves belonging to one decision
class they will be combined into one branch in the decision structure The symbol 1 means
that an attribute is needed to classify the two decision classes In such cases there will be at least
two additional paths
D(A Ci) = 0 D(A CJ) = 1 D(A Ci) =2 D(A Cj) = 2 D(A Ci) =3 D(A Cj) =3
Figure 3-3 Yen Diagrams of Possible combinations of attribute values in two decision classes
The average number of tests required for making a decision in each possible case is determined
in Figure 3-4 It is clear that in the case of two decision classes the disjointness criterion ranks
highly attributes that reduces the average number of tests required for decision-making The
theorem can be proved in the general case
i ~ i Cj C1middot bull Ci Cj
ANT=32middotANT=53 ANT=l
1 means at least one attribute is needed to complete the decision tree Figure 3-4 Decision trees corresponding the Yen diagrams inFigure 3-3
34
Theorem 2 The attribute with the highest non~zero disjointness is the best attribute to classify
the decision classes
Proof Suppose that the number of decision classes is n Assume also that there are two attributes
A and B where D(A) lt D(B) Since the disjointness criterion considers the mutual relationship
between any two decision classes this means that there are more decision classes where D(A
Ci) lt D(B Ci) than those where D(A Ci) gt D(B Ci) (ignoring classes with equal disjointness
for bCth attributes) Hence there are more decision classes where D(A Ci Cj) lt D(B Ci Cj)
than those where D(A Ci Cj) gt D(B Cit Cj) Let us frrst prove that if D(A Ci Cj) lt D(B Cit
Cj) then B classifies the decision classes better than A
For each pair of decision classes Ci and Cj the possible values of the disjointness of any attribute
are 0 14 or 6 For all positive values ofD(B)= 14 or 6 it is clear that attribute B should have
smaller ANT than attribute A which has lower disjointness This means that if D(A Cit Cj) lt
D(B Cit Cj) then attribute B can classify both decision classes better than attribute A
Similarly B is better for classifying more pairs of decision classes than A This implies that B is
a better classifier than A
Importance The second elementary criterion the importance of a test is based on the
importance score (IS) introduced in (Imam Michalski amp Kerschberg 1993) In the obtained
rules each test is assigned a score that represents the total number of training examples that
are covered by the rules involving this test Decision rules learned by an AQ learning program
are accompanied with information on their strength Rule strength is characterized by t-weight
and u~weight The t-weight (total-weight) of a rule for some class is the number of examples of
that class covered by the rule The importance score of a test is the aggregation of the totalshy
weights of all rules that contain that test in their condition part Given a set of decision rules for
m decision classes Cl Cm and involving n tests AlbullAn and the number of rules associated
with class Ci is denoted by tlrit The importance score is defmed as follow
35
Definition 3 -3 The importance score IS(Aj) of the test A j is detennined by
m IS(Aj)= L IS(Aj Ci) (3-31)
i=l
where ri IS(Aj Ci) = L Rik(Aj) (3-32)
k=l
and Rik the weight of a test Aj in the rule Rk of class Ci is given by
t - weight if A j belongs to rule Rik Rik (Amiddot) = (3-4)J 0 otherwise
where i=I n ik=I ri j=I m
The importance score method has been separately compared as a feature selection method with
a genetic algorithm-based method (Imam amp Vafaie 1994) The importance score method
produced an equal or higher accuracy on three real-world problems than those reported by the
GA method while selecting fewer attributes In addition the IS method was significantly faster
than the GA
yalue distribution The third elementary criterion value distribution concerns the number of
legal values of tests Given two tests with equal importance scores this criterion prefers the test
with the smaller number of legal values Experiments have shown that this criterion is especially
useful when using discriminant decision rules
Definition 34 A value distribution VD(Aj) of a test A j is defined by
VD(Aj) = IS(Aj) I Vj (3-5)
where v is the number of legal values of Aj
Dominance The fourth elementary criterion dominance prefers tests that appear in large
numbers of rules as this indicates their high relevance for discriminating among the rulesets of
given decision classes Since some conditions in the rules have values linked by internal
disjunction counting such rules directly would not reflect properly their relevance Therefore
36
for computing the dominance the rules are counted as if they were converted to rules that do not
have internal disjunction Such a conversion is done by multiplying out the condition parts of
the rules containing internal disjunction For example the condition part [x3=1 v 3]amp[x4=1] is
multiplied out to two rules with condition parts [x3=1]amp[x4=1] and [x3=3]amp[x4=I]
The above criteria are combined into one general test measure using the lexicographic
evaluation functional with tolerances (LEF) (Michalski 1973) LEF is a list of some or all of
the above elementary criteria each associated with a Ittolerance threshold It in percentage The
criteria are applied to tests in the order defined by LEF A test passes to the next criterion only if
it scores on the previous criterion within the range defined by the tolerance (from the top value)
The default LEF is
ltCost d Disjointness t2 Importance t3 Value distr t4 Dominance 0gt (3-6)
where d a t3 t4 and t5 are tolerance thresholds (in percentages) their default values are O
The default value of the cost of each test is 1
The above LEF ranks attributes this way First attributes are evaluated on the basis of their cost
If two or more attributes have the same cost they ranked by their degree of disjointness If two
or more attributes still share the same top score or their scores differ less than the assumed
tolerance threshold t2 the method evaluates these attributes using the second (Importance)
criterion If again two or more attributes share the same top score or their scores differ less than
the tolerance threshold t3 then the third criterion normalized IS is used then similarly the
fourth criterion (dominance) If there is still a tie the method selects the best attribute
randomly
If there is a non-uniform frequency distribution of examples of different classes then the
selection criterion uses a modified definition of the disjointness Namely the previously defmed
37
disjointness for each class is multiplied by the frequency of the class occurrence The class
occurrence is the expected number of future examples that are to be classified to a given class
m
Disjointness(A) = L D(A CO Frq(Ci) (3-7) i=l
where Frq(Cn is the expected frequency of examples of class Cit and it is assumed to be given
by the user The attribute ranking criteria in this case is defined by the LEF
ltCost d Disjointness t2 Importance 13 Normalized-IS 14 Dominance 15gt (3-8)
where the Cost denotes the evaluation cost of an attribute and is to be minimized while other
elementary criteria are treated the same way as in the default version
332 The AQDT-2 algorithm
The AQDT-2 algorithm constructs a decision structure from decision rules by recursively
selecting at each step the best test according to the ranking criteria described above and
assigning it to the new node The process stops when the algorithm creates terminal branches that
are assigned decision classes To facilitate such a process the system creates a special data
structure for each concept description (ruleset) This structure has fields such as the number of
rules the number of decision classes and the number of attributes present in the rules A set of
pointers connect this data structure to set of data structures each represents one decision class
The decision class structure contains fields with information on the number of rules belong to
that class the frequency of the decision class etc It is also connected to a set of data structures
representing the decision rules within each decision class The system creates independently a set
of data structures each is corresponding to one attribute Each attribute description contains the
attributes name domain type the number of legal values a list of the values the number of
rules that contain that attribute and values of that attribute for each rule The attributes are
arranged in an array in a lexicographic order flISt in the descending order of the number of rules
38
that contain that attribute and second in the ascending order of the number of the attributes
legal values
The system can work in two modes In the standard mode the system generates standard
decision trees in which each branch has a specific attribute-value assigned In the compact
mode the system builds a decision structure that may contain
A) or branches Le branches assigned an internal disjunction of attribute-values whenever it
leads to simpler structures For example if a node assigned attribute A has a branch marked by
values 1 v 2 then the control passes along this branch whenever A takes value 1 or 2 The
program creates or branches on the basis of the analysis of the value sets Vi while computing
the degree of attribute disjointness
B) nodes that are assigned derived attributes that is attributes that are certain logical or
mathematical combinations of the original attributes To produce decision structures with
derived attributes the input decision rules are generated by AQ17 (rather than AQI5) The
AQ17 rules may contain conditions involving attributes constructed by the program rather than
those originally given
To generate decision structures from rules the AQDT-2 method prefers either characteristic or
discriminant disjoint rule descriptions (given by an expert or learned by a system) Disjoint rules
are more suitable for building decision structures Assume that the description of each class is in
the form of a ruleset and that this set is the initial ruleset context The AQDT algorithm is
The AQDT2 Algorithm
Step 1 Evaluate each attribute occurring in the ruleset context using the LEF attribute ranking
measure Select the highest ranked attribute Let A represents this highest-ranked
attribute
Step 2 Create a node of the tree (initially the root afterwards a node attached to a branch) and
assign to it the attribute A In standard mode create as many branches from the node as
39
there are legal values of the attribute A and assign these values to the branches In
compact mode (decision structures) create as many branches as there are disjoint value
sets of this attribute in the decision rules and assign these sets to the branches
Step 3 For each branch associate with it a group of rules from the ruleset context that contain a
condition satisfied by the value(s) assigned to this branch For example if a branch is
assigned values i of attribute A then associate with it all rules containing condition [A=
i v ] If a branch is assigned values i v j then associate with it all rules containing
condition [A= i v j v ] Remove from the rules these conditions If there are rules in
the ruleset context that do not contain attribute A add these rules to all rule groups
associated with the branches stemming from the node assigned attribute A (This step is
justified by the consensus law [x=1] == [x=1] amp [y=a] v [x=1] amp [y=b] assuming
that a and b are the only legal values of y) All rules associated with the given branch
constitute a ruleset context for this branch
Step 4 If all the rules in a ruleset context for some branch belong to the same class create a leaf
node and assign to it that class If all branches of the trees have leaf nodes stop
Otherwise repeat steps 1 to 4 for each branch that has no leaf
To select an attribute to be a node in the decision tree (step 1 and 2 of the algorithm) the
algorithm performs two independent iterations In the first iteration it parses through all decision
rules and determine information about each attributes This information includes the importance
score of each attribute the number of rules containing a given attribute the disjoint value sets of
each attribute and the attribute values used in describing each decision class The second
iteration is only performed if the disjointness criterion is ranked first in the LEF function The
second iteration evaluates the attributes disjointness for each decision class against the other
decision classes
40
To determine the complexity of this process suppose that the maximum number of conditions
fonned by a single attribute value in one rule is s (where s ~ n the number of attributes) and r is
the total number of decision rules (in all decision classes)
m
r=LRt (m is the number of decision classes) 1=1
where Ri is the number of rules in decision class Ci bull The complexity of the frrst iteration can be
determined as
Cmpx(Itcl) =O(r s)
In the second iteration the disjointness is calculated between the decision classes and all
attributes The complexity of the second iteration can be given by
Cmpx(Itc2) = O(n m)
Assume that at each node 1 is the maximum of the number of decision classes to be classified at
this node and the number of rules associated with this node
l=max mr (3-9)
Since these two iteration are independent of each other the complexity of the AQDT algorithm
for building one node in the decision tree say Node Complexity NC(AQDT) is given by
NC(AQDT) = 0(1 n)
Usually I equals to the number of rules associated with the given node Thus the AQDT
complexity for building one node is a function of the number of attributes multiplied by the
number of rules associated with this node
At each level of the decision tree these two iterations are repeated for each non-leaf node The
Level Complexity of the AQDT algorithm LC(AQDT) can be given by
LC(AQDT) lt 0(1 n)
Which is less than the complexity of generating the root of the decision tree To explain this
consider the maximum possible number of non-leaf nodes at one level is half the number of the
initial decision rules (integer of rll) Figure 3-5-a shows an example of such situation where at
41
the lowest level each node classifies only two rules each belongs to different decision class This
decision tree could be a multimiddotvalues tree However the number of non-leaf nodes at each level
will be twice or more than the number of nonmiddot leaf nodes at the previous level Considering the
level complexity of the AQDT algorithm equal to (1 s 0) where 0 is the number of non-leaf
nodes at the given level In such cases it is either (1 0 S r) or (1 s lt r) In Figure 3middot5middota the
complexity of the AQDT algorithm at any lower level is given by
LC(AQDT) =0(2 s r(2) lt O(n I) =NC(AQDT)
a) per one level b) per one path
Figure 3 -5 Decision trees showing the maximum number of none leaf nodes
Note also that after selecting an attribute to be the root of the decision structure this attribute
and all conditions containing that attribute are removed from the data structure of the algorithm
Also if a leaf node is generated all rules belonging to the corresponding branch will not be
tested again
Since the disjointness criterion selects the attribute which minimizes the average number of tests
ANT it is true that the AQDT algorithm generates decision trees with the least number of levels
The number of levels per a decision tree is supposed to be less than or equal to the minimum of
both the number of attributes and the number of rules Consider k as the number of levels in a
given decision tree
k S min mr (3-10)
There are two cases represent the most complex situations Figure 3-5-a and 3-5-b In the fIrst
case where the decision rules were divided evenly the number of levels will be a function of the
42
logarithm of the number of rules In such case the complexity of the AQDT algorithm for
generating a decision tree from a set of decision rules is given by
Complexity(AQDT) = 0(1 n log r) (3-11)
The other situation is when the generated decision tree has the maximum number of levels The
maximum possible number of levels per a decision tree equals to one less than the number of
decision rules Figure 3-5-b Using the disjointness criterion it is not likely to get such a decision
tree because it has the maximum average number of test (ANT) that can be determined from the
same set of nodes and leaves However such a decision tree can be generated if the number of
decision classes is one less thanthe number of attributes In such case any disjQint decision rules
should have a maximum length that is less or equal to the floor of the logarithm of the number of
attributes Thus the level complexity of this decision tree is estimated as
LC(AQDT) =0(1 log n)
The maximum number of levels in such decision tree is k-l Thus the complexity of the AQDT
algorithm in such cases is given by
Complexity(AQDT) = 0(1 k log n) (3-12)
Since k is the minimum of n and r then n is the maximum of k and n Also since in (3-12)
r=n+1 then log r is greater than log n From 3-10 3-11 and 3-12 the complexity of the AQDT
algorithm is determined by
Cmplx(AQDT) = OCr k log 1) (3-13)
333 An example illustrating the algorithm
The following simple example illustrates how AQDT is used in selecting the optimal set of
testing resources for testing a software Suppose there are three tools for testing a software 1)
modeling (Tl) 2) checklist (T2) and 3) par_simul (T3) Also assume that there are four
different factors that affect the selection of any tool 1) the cost of using the tool (xl) 2) the
metrics that support the tool (x2) 3) the best phase for applying the tool (x3) and 4) type of tool
43
(automated semi-automated or manual) (x4) Table 3-1 shows these attributes and their possible
values
Suppose that the domain expert provided a set of rules to be used in testing a software Figure 3shy
6 shows a sample of these rules in AQ15c formats
Table 3-1 The available tools and the factors that affect the prclCes
Tl lt= [xl=2] amp [x2=2] v [xl=3] amp [x3=1 v 3] amp [x4=1] T2 lt= [xl=l v 2] amp [x2=3 v 4] v [xl=3] amp [x3=1 v 2] amp [x4=2] T3 lt= [xl=l] amp [x2=1] v [xl=4] amp [x3=2 v 3] amp [x4=3]
Figure 3-6 Decision rules for selecting the best tool for testing software
These rules can be interpreted as
Rule 1 Use the fltst tool for testing if you need average cost and the tool is
supported by the requirement metric
Rule 2 Use the fIrst tool for testing if you can afford high cost for testing either
in the requirement or the analysis phases and you need an automated tool
Rule 3 Use the second tool for testing if the cost limit is low or average and the
tool is supported by the system usage metric or the intractability metric
Rule 4 Use the second tool for testing if you can afford high cost for testing
either in the requirement or the design phases and you need a manual tool
Rule 5 Use the third tool for testing if the you are limited to low cost and the
tool is supported by the error rate metric
Rule 6 Use the third tool for testing if you can afford very-high cost for testing
either in the requirement or the system usage phases and you need a semishy
automated tool
44
Table 3-2 presents information on these rules and the disjointness values for all attributes For
each class the row marked Values lists values occurring in the ruleset for this class For
evaluating the disjointness of an attribute say A each rule in the ruleset above that does not
contain attribute A is characterized as having an additional condition [A= a vb ] where a b
are all legal values of A
The row Class disjointness specifies the class disjointness for each attribute The attribute xl
has the highest disjointness (11) and is assigned to the root of the tree For simplicity assume
the tolerances for each elementary criterion equal O
From the rules in Figure 3-6 we can also determine disjoint groupings of attribute-values used
for the compact mode of the algorithm This done as follows 1) determine for each attribute the
sets of values that the attribute takes in individual decision rules and remove those value sets
that subsume other value sets The remaining value sets are assigned to branches stemming from
the node marked by the given attribute For example x 1 has the following value sets in the
individual decision rules 2 3 I 2 I and 4 (Figure 3-6) Value set I 2 is
removed as it subsumes 2 and I In this case branches are assigned individual values of the
domain of x1 For attribute x2 the value sets are I 2 34 and I 2 3 4 In this case
branches are assigned value sets I 2 and 34
45
Attribute Xl ranks highest (as it is the highest disjointness) and is assigned to the root of the
tree Four branches are created each one corresponding to one of XIs possible values Since all
rules containing [x1=4] belong to class TI the branch marked by 4 is ended by a leafTI Rules
containing other values of X1 belong to more than one class This process is repeated for each
subset of rules until the decision tree is completed Figure 3-7 shows a decision structure learned
by AQDT-2 from the rules in Figure 3-6 The decision structure in Figure 3-7 can be used in
making decisions on which tools can be used for testing a given software
xl
Complexity No of nodes 4 No of leaves 7
Figure 3-7 A decision structure learned for classifying so tware testing tools
Figure 3-8-a shows the diagrammatic visualization of the decision rules and Figure 3-8-b shows
the visualization of the derived decision tree Each diagram in Figure 3-8 consists of cells
representing one combination of attribute-values Attributes and their legal values are shown on
scales surrounding the diagram (eg the horizontal scale for xl shows values 1 2 3 and 4)
Rules are represented by collections of cells in the intersection of the rows and columns
corresponding to the conditions in the rules
The shaded areas correspond to decision rules Rules of the same class have the same type of
shading Empty cells correspond to combination of attribute-values not assigned to any class For
illustration collections of cells corresponding to some of the initial rules are marked R 11 R 21
R31 and R32 Rll denotes the IlfSt rule of Class T1 ie [x1=2] amp [x2=2] R21 denotes the
IlfSt rule of class T2 ie [xl= 1 v 2] amp [x2= 3 v 4] R31 the first rule of class TI ie [x1=1]
46
amp [x2 =1] and denotes R32 denotes the second rule of class T3 Le [xl=4] amp [x3=2 v 3] amp
[x4=3]
a) Decision rules b) Derived decision tree
Comparing diagrams in Figure 3-8-a and 3-8-b AQDT-2 decision tree represents a slightly more
general description of Concepts Tl TI and T3 than the original rules Let us assume that it is
very costly to know which metrics suppon the required tools In other words suppose that we
would like to select the best tools independently of the metric they suppon (this is indicated to
AQDT-l by assigning very high cost to attribute x2) The algorithm again assigns attribute xl to
the root of the tree When xl takes value 1 or 2 it is impossible to assign a specific decision
without measuring attribute x2 For the value 1 of xl the recommended tool can be either TI or
T3 (see the diagram in Figure 3-9-a) and for the value 2 of xl the recommended tool can be
either Tl or T3 However for the value 3 of xl one can make a specific decision after
measuring attribute x4 For the value 1 of x4 tool is Tl and for the value 2 of x4 the
recommended tool is TI Figure 3-9-b shows another decision tree where the type of the tool
was ignored from the data Those decision trees are called indeterminate because some of their
leaves are assigned a disjunction of two or more class names
47
xl
xl
1 12
a) Ignoring the supporting metric b) Ignoring the type of the tooL Figure 3 -9 Decision trees learned ignoring the support metric and the type of the testing tool
It is clear that the given set of rules depend highly on amount of money can be spent Let us now
suppose that the cost attribute xl which was determined as the highest ranked attribute cannot
be measured The algorithm selected x4 to be the root of the new decision tree After continuing
the algorithm the tree in Figure 3-10 is obtainedThe nodes that are assigned one class indicate
situations in which it is possible to make a specific decision without knowing the value of
attribute xl
x4
Figure 3-10 A decision tree learned without the cost attribute
34 Tailoring Decision Structures to the Decision-making situation
Decision structures are among the simplest structures for organizing a decision-making process
A decision structure specifies explicitly the order in which attributes of an object or situation
need to be evaluated in the process of determining a decision A standard way to generate a
48
decision structure is to learn it from examples of decisions Such a process usually aims at
obtaining a decision structure that has the highest prediction accuracy that is the highest rate of
assigning correct decisions to given situations There can usually be a large number of logically
equivalent decision structures (Michalski 1990) As such they may have the same predictive
accuracy but differ in the way they organize the decision process and thus may differ in the cost
of arriving at a decision To minimize the average decision cost one needs to take into
consideration the distribution of the costs of attribute evaluation and the frequency of different
decisions This report presents an approach to building such Task-oriented decision structures
which advocates that they are built not from examples but rather from decision rules Decision
rules are learned from examples using the AQ15 or AQ17 inductive learning program or are
specified by an expert An efficient algorithm and a new system AQDT-2 that transforms
decision rules to task-oriented decision structures The system is illustrated by applying it to the
problem of learning decision structures in the area of construction engineering (for determining
the best wind bracing for tall buildings) In the experiments AQDT-2 outperformed all other
programs applied to the same data
Decision-making situations can vary in several respects In some situations complete
information about a data item is available (ie values of all attributes are specified) in others the
information may be incomplete To reflect such differences the user specifies a set of parameters
that control the process of creating a decision structure from decision rules AQDT-2 provides
several features for handling different decision-making problems 1) generating a decision
structure that tries to avoid unavailable or costly attributes (tests) 2) generating unknown
leaves in situations where there is insufficient information for generating a complete decision
structure 3) providing the user with the most likely decision when performing a required test is
impossible 4) providing alternative decisions with an estimate of the likelihood of their
correctness when the needed information can not be provided
49
341 Learning Costmiddot Dependent Decision Structures
As described in Sec 2 the LEF criterion enables the system to take into consideration the cost of
measuring tests (attributes) in developing a decision structure In the default setting of LEF the
tests cost is the [lIst criterion and its tolerance is O This means that only the least expensive
attributes pass to the next step of evaluation involving other elementary criteria If an attribute
has high cost or is impossible to measure (has an infinite cost) the LEF chooses another
cheaper attribute ifpossible
342 Assigning Decision Under Insufficient Information
In decision-making situations in which one or more attributes cannot be measured the system
may not be able to assign a definite decision for some cases If no more information can be
obtained but a decision has to be made it is useful to know the probability distribution for
different candidate decisions (Smyth Goodman amp Higgins 1990) The most probable decision is
then chosen The probability distribution can be estimated from the class frequency at the given
node Let us consider a node in the structure connected to the root by the sequence bl b2 bkof
attribute-values Let P(Ci I bl b2 bk) denote the conditional probability of class Ci i=I2 m
at that node given that an example to be classified has attribute-values assigned to branches bl
b2 bull bk in the decision structure Using Bayesian formula we have
P(Ci I bl~ bull bk) = P(Ci) P(bl bk I Ci) P(bl bk) (3-9)
where P(Ci) the apriori probability of class Ci P(bl bk I Ci) is the probability that the
example has attribute-values blbull bk given that it represents decision class Ci To approximate
these probabilities let us suppose that Wi is the number of training examples of Class Ci that
passed the tests leading to this node and twi - the total number of training examples of Class
Ci Let us use these numbers to estimate the probabilities in (9) Assuming that the apriori class
probabilities correspond to the frequency of training examples from different classes we have
50
m P(Ci) = twi IL twj (3-10)
j=l
P(b1 bkl Ci) = Wi I twi (3-11)
m m P(b1 bk) =L Wj IL twj (3-12)
j=l j=1
By substituting (1O) (11) and (12) in (9) we obtain
m P(Ci I b1 bk) = wilL Wj (3-13)
j=l
A related method for handling the problem of unavailability of an attribute is described by
Quinlan (1990) Quinlans method however assigns probabilities to the most probable decision
at the node associated with such an attribute The method does not restructure the decision tree
appropriately to fit the given decision-making situation (in this case to avoid measuring xl)
AQDT -2 assigns probabilities o~ly after it first created a decision structure that suits the given
decision-making si~ation An example is presented in section 42
343 Coping with noise in training data
The proposed methodology can be easily extended to handle the problem of learning from noisy
training data This is done based on ideas of rule truncation described in (Niblett amp Bratko 1986
Bergadanoet al 1992 Cestnik amp Karalic 1991) When noise is expected in the training data
the decision rules with t-weight below a certain threshold (reflecting the expected noise level) are
removed The rule truncation method seems to result in more accurate decision structures than
decision tree pruning method because truncation decisions are solely based on the importance of
the given rule or condition for the decision-making regardless of their evaluation order (unlike
the decision tree pruning which can only prunes attributes within a subtree and thus cannot
freely chose attributes to prune) Examples are presented in section 4
51
3S Analysis of the AQDT2 Attribute Selection Criteria
This subsection introduces an analysis of the attribute selection criteria used in AQDT-2 The
analysis is done using two different problems Both problems assume that the best attribute to be
selected is known and they test whether or not a given attribute selection criterion will rank that
attribute first The first problem was introduced by Quinlan in 1993 The problem has four
attributes (see Table 2-2) and two decision classes The best attribute to be selected is
Outlook The second problem was introduced by Mingers in 1989 Mingers problem has two
attributes and two decision classes The problem has ambiguous examples (Le examples
belonging to more than one decision class)
In the first problem the following are disjoint rules learned by AQ15c from the given data
Play lt [outlook =overcast] Play lt [outlook =sunny]amp [humidity ~ 75] Play lt [outlook = rain] amp [windy = false]
For any combination of the LEF function AQDT-2Iearns a decision tree similar to the one in
Figure 2-3 Table 3-3 shows the values of AQDT-2 criteria for different attributes of the given
data It is clear that all of the AQDT-2 selection criteria preferred the attribute Outlook over
the other attributes
Table 3-4 shows the set of examples used in the second experiment Note that the representation
of the data here is different from the representation used by Mingers The attribute X is better for
building decision tree than the attribute Y The ambiguity in the data increases the complexity of
selecting the correct attribute to be the root of the tree The goal of this experiment is frrst to
52
select the correct attribute and then to test how the given criterion may evaluate the two
attributes In Mingers experiment all criteria used preferred the fIrst attribute (X) over the
second (Y) However the gain ratio criteria gave X higher score than all other criteria
Table 3-5 shows the performance of the AQDT-2 criteria for selecting attributes on Mingers
first problem The criteria were tested when applied on both examples and rules learned by
AQ15c The ambiguity parameter in AQ15c was set to Pos That is when learning decision
rules for class C all ambigous examples belonging to C are considered as examples that belong
toC only
The table shows that the disjointness criterion outperformed all other criteria in Table 2-6
including the gain ratio criterion when it was applied to both the original examples and the rules
learned from these examples It was clear that neither the importance score nor the value
distribution criteria will perform better in the case of evaluating the training examples This is
because the two criteria depend on the relationship between the attributes and the decision rules
53
which can not be measured from examples When these criteria applied on the learned rules they
provide very good results
The disjointness criterion selects the attribute that best discriminates between the decision
classes The importance criterion gives the highest score to the attribute that appears in rules
covering the largest number of examples The value distribution criterion ranks first the attribute
which has the maximum balanced appearance of its values in different rules The dominance
criterion prefers attributes that exist in large numbers of elementary rules Table 3-6 shows the
possible LEF ranks for each criterion the domain of each criterion and when the criteria is better
to be ranked fIrst
In Table 3-6 n is the number of attributes t is the t-weight of a rule v is the number of values of
an attribute and R is the total number of elementary rules
36 Decision Structures vs Decision Trees
This subsection introduces a comparison between the decision structures proposed in this thesis
and traditional decision trees Even though systems for learning decision structure may be more
complex and they take more time to generate decision structures from examples They have
many other advantages Table 3-7 shows a comparison between the two approaches
54
between Decision Structures and Decision Tree
Another important issue is that simpler decision trees are not necessarly equivalent to simpler
decision structures For example consider the two decision structures in Figure 3-11 Both
structures are equivalent to one another The decision structure in Figure 3-11-a has 12 nodes
This decision structure is equivalent to a decision tree with 41 nodes On the other hand the
decision structure in Figure 3-11-b has four more nodes a total of 16 nodes but it is equivalent
to a decision tree with 37 nodes
55
x5
p Positive Nmiddot Negative No of nodes 5
a) Using the Disjoinmess criterion xl
P-Positive N bull Negative No of nodes 7 No of leaves 9
b) Using the importance score criterion Figure 3middot Decision structures learned by AQDT-2 using different criteria
To show another advantage of learning decision structures (trees) from decision rules rather than
from examples I created an example called the Imams example that represents a class of
problems in which decision tree learning programs information gain criteria does not work
properly The basic idea behind this example is based on the fact that the information-based
criteria are based on the frequency of the training examples per decision class and the frequency
of the training examples over different values of a given attribute
The concept to learn is P if xl=x2 and N otherwise
The known training examples are shown in Figure 3-12-a where u+ means this example belongs
to class uP and U_ means this example belongs to class N Figure 3-12-b shows the correct
decision tree to be learned As the reader can see the number of examples per each class is 12
and the frequency of the training examples per each value of xl and x2 (the most important
attributes) is 6 However the frequency of the training examples per each of the values of x3 and
x4 are different This combination makes it difficult for criteria using information gain to select
56
either xl or x2 When C4S ran with different window settings it was not able to select neither
xl nor x2 as the root of the decision tree
~ ~
-~ t-) r- shy~
-I-shy
t-) t-)
I-shy~
1~_1_1~~_1__3~__~2__~1 11 a) Training examples b) The optimal decision tree
Figure 3-12 The Imams example Example where learning decision structures (trees) from rules is better than learning them from examples
AQ15c learned the following rules from this data
P lt [xl=l][x2=l] [xl=2][x2=2] N lt [xl=1][x2=2] [xl=2][x2=1]
From these rules AQDT-2Iearns the decision tree in Figure 3-l2-b
An example of problems in which decision trees may not be an efficient way to represent
knowledge can be shown in Figure 3-l3-a Figure 3-l3-b shows the correct decision tree for the
given data The decision rules learned from this data are
P lt [xl=2] [x2=2] N lt [xl=1][x2=1 v 3] [xl=3][x2=1 v 3]
Using different measures of complexity it is clear that for such a problem learning decision trees
directly from examples if not an efficient method Examples of these measures are 1) comparing
the number of nodes in the decision tree to the number of examples (109) 2) comparing the
average number of tests required to make a decision (13n for the decision tree and 85 for
57
decision rules) 3) comparing the number of nodes to the number of conditions (10 nodes and 10
conditions)
When learning a decision tree from rules learned with constructive induction a decision tree
with three nodes can be determined using the new attribute xllx2=2 with values 0 for no
and 1 for yes
1 2 31sect] a) The training data b) The correct decision tree
Figure 3middot13 An example where decision rules are simpler than decision trees
CHAPTER 4 Empirical Analysis and Comparative Study
This section presents empirical results from extensive testing of the method on six different
problems using different sizes of training data and applying different settings of the systems
parameters For comparison it also presents results from applying a well-known decision tree
learning system (C45) to the same problems This section also includes some analysis and
visualization of the learned concepts by AQI5c and AQDJ2
The experiments are applied to the following problems MONK-I MONK-2 MONK-3
Engineering Design of Wind Bracing Classification of Mushrooms Diagnosing Breast Cancer
Congressional voting records of 1984 and East-West Trains The MONKs problems are concerned
with learning classification rules for robot-like figures MONK-l requires learning a DNF-type
description MONK-2 requires learning a non-DNF-type description (one that cannot be easily
described in DNF from using the original attributes) MONK-3 involves learning a DNF rule from
noisy data The Engineering Design dataset involves learning conditions for applying different
types of wind bracings for tall buildings Mushrooms is concerned with learning classification
rules for distinguishing between poisonous and non-poisonous mushrooms Breast Cancer
involves learning concept descriptions for recognizing breast cancer Congressional bting
records describes the voting records of republican and democratic US senators for 1984 The
East-West Train characterizes eastbound and westbound trains using structural representation
In order to determine the learning curve the system was run with different relative sizes of the
training data 10 20 90 Specifically from the set of available examples for each
problem 100 randomly chosen samples of 10 of data then 20 etc bull were used for training
that is for learning a concept description The remaining examples in each case were used for
testing the obtained descriptions to determine the prediction accuracy of the descriptions
58
59
41 Description of the Experimental Analysis
This section describes the complete experimental analysis The problems (datasets) were divided
into two subsets The first set of problems (MONK-I MONK-2 MONK-3 and the Wind
Bracing problems) were used to test and analyze the approach The second set of problems
(Mushroom Breast Cancer Congressional bting and the East-West lrains) were used for
additional testing and comparison with other systems
Figure 4-1 shows a description of the planned experiments done on the fIrst set of problems The
best settings (best path from top-down) in terms of accuracy time and complexity were used as
default settings for experiments in the second set ofproblems Each path from the top of the graph
to the bottom represents a single experiment For each path the experiment was repeated over 900
times with different sets of different sizes of training examples
Figure 4-1 Design ofa complete experiment
60
For each of these experiments the testing examples were selected as the complementary set of the
training examples Other experiments were performed where the learning system AQ17 is used
instead of AQ15c Analysis of some experiments included visualization of the training examples
and the concept learned by each learning program (AQ15c AQDJ2 and C45) Also the different
decision structures learned for different decision-making situation were visualized as were different
but equivalent decision structures learned for a given set of training examples
For each problem (ie Database) 9 different relative sample sizes of training examples are selected (10 90) 100 random samples of each size are drawn from the original data for training 100 samples which remains from the original data after drawing the training data are
used for testing (900 samples for training and their 900 complementary samples for testing)
162 different parametrical experiments per each training dataset (18 9) 16200 experiments lone sample size (9 samples) 145800 experiments I first portion of a problem (AQI5c followed by AQDT-2) 199800 experiments I problem (flISt portion + C45 + Constructive Induction) 999000 intermediate processes (eg changing format refming data storing results etc)
73 days (estimated running time)
The following subsection includes a complete experimental analysis on the wind bracing problem
Each subsection following that will describe partial or full experimental analysis ofone of the other
problems
42 Experiments With Average Size Complex and Noise-Free Problems Wind
Bracing
This section illustrates the method by applying it to the problem of learning a decision structure for
determining the structural quality of a tall building design The quality of the design is partitioned
into four classes high (c1) medium (c2) low (c3) and infeasible (c4) Each example is
characterized by seven atnibutes number of stories (xl) bay length (x2) wind intensity (x3)
number of joints (x4) number of bays (xS) number of vertical trusses (x6) and number of
horizontal trusses (x7) The data consisted of 335 examples of which 220 (66) were randomly
selected to serve as training examples and 115 (34) were used for testing the obtained decision
61
structure In the first phase training examples were used to detennine a set of decision rules This
was done by the program AQ15c (Michalski et al 1986) Figure 4-2 shows the decision rules
obtained by AQI5c
These rules were then used by AQD2 to detennine a decision structure Table 4-1 presents values
of the four elementary criteria for each attribute occurring in the rules for the step of detennining
the root of the decision structure For each class the row marked values lists values occurring in
the ruleset for this class For evaluating the disjointness of an attribute say A each rule in the
ruleset that does not contain attribute A is assumed to contain an additional condition [A= a v b
] where a b are all legal values of A
Decision class Cl 1 [xl=l] [x6=l] [x2=I2][x3=I2] [x4=13][x5=I2][x7=1 3] (t 18 U 18) 2 [xl=3][x2=I][x3=I][x5=I][x6=1][x4=I3][x7=134] (t 3 u 3) 3 [xl=5][x2=2][x3=2][x5=2][x4=3][x6=1][x7=23] (t 2 u 2) 4 [xl=l][x6=I][x2=2][x3=12] [x4=3][x5=I2][x7=4] (t 2 u 2) 5 [xl=3][x2=1][x4=l][x6=l][x7=1][x3=2][x5=12] (t 2 u 2) 6 [x l=I][x3=l][x6=l][x2=2][x4=I3][x7=13][x5=3] (t 2 u 2) 7 [xl=2] [x5=2][x2=l][x6=l] [x3=I2] [x4=3][x7=4] (t 2 u 2)
Decision class C2 1 [xl=2 4][x2=12][x3=12][x4=3] [x5=23][x6=1] [x7=23] (t 28 U 19) 2 [xl=2bull4][x2=2][x3=I2][x4=3][x5=12][x6=1][x7=34] (t 17 U 6) 3 [xl=24][x2=12][x3=12][x4=3][x5=1][x6=l][x7=34] (t 10 U 4) 4 [xl=l35] [x2=l2][x3=I2) [x4=3] [x5=3][x6=l][x7=24] (t 10 U 2) 5 [xl=35][x2=I2][x3=12][x4=3][x5=23][x6=1][x7=14] (t 9 U 4) 6 [x1=2][x2=12][x3=l2][x5=123][x4=l][x6=l][x7=1] (t 7 U 6) 7 [x1=34](x2=2][x3=2][x4=13][x5=13] [x6=l][x7=12] (t 6 U 4) 8 [xl=35][x2=2] [x3=l][x7=I][x4=12] [x5=I23] [x6=13] (t 5 U 5) 9 [xl=l][x2=l] [x6=l] [x3=I2] [x4=3][x5=I2] [x7=4] (t 4 U 4) 10 [xl=l] [x5=1][x2=2][x4=2][x6=2)[x3=I2] [x7=1 3] (t 4 U 4) 11 [xl=I2][x2=1][x6=I][x3=12][x4=I3][x5=3][x7=I4] (t 4 U 2)
Decision class C3 1 [xl=25] [x2=I2] [x3=12][x7=1 4] [x4=12] [x5=13] [x6=24] (t 41 U 32) 2 [xl=1 4][x2=12][x3=12][x4=2][x5=2)[x6=23] [x7=24] (t 27 U 20) 3 [xl=I3] [x2=I][x3=12] [x7=14][x4=2] [x5=I2][x6=23] (t19 U 6) 4 [xl= 1 24] [x2=I2][x3=I2][x4=2] [x5=23][x6=34][x7=1] (t 13 U 8) 5 [xl=5][x2=2][x4=2] [x5=2] [x3=12][x6=3][x7=24] (t 5 U 5)
Decision class C4 1 [x1=5][x2=2][x32][x4=13][x5=1][x6=l][x7=14] (t 4 U 4) 2 [x1=5][x2=2][x3=l][x5=I][x6=I][x4=3][x7=3] (t I U 1)
Figure 4-2 Decision rules determined by AQI5c from the wind bracing data
62
Assuming the default LEF attribute x6 was chosen for the root (since its disjointness is the single
highest and all other attributes are beyond the tolerance threshold no other attributes are
considered) Branches stemming from the root are marked by values x6 (in general it could be
groups of values) according to the way they occur in the decision rules groups subsumed by
other groups are removed (Imam amp Michalski 1993b) The branches are assigned subsets of the
rules containing these values The process repeats for a branch until all rules assigned to each
branch are of the same class That class is then assigned to the leaf
Figure 4-4 presents a decision structure detennined by AQDT-2 from decision rules in Figure 4-2
(using the default LEF) The structure was evaluated on the testing examples The prediction
accuracy was 887 (102 testing examples were classified correctly and 13 misclassified)
Since all rules containing [X6=4] belong to class C3 the branch marked by 4 is ended by a leaf C3
Rules containing [x6=1] belong to more than one class In this case the first three criteria are
recalculated only for those rules which contain [X6=l] as one of their conjunctions In this
example Xl has the highest importance score so it was selected to be a node in the structure This
process is repeated for each subset of rules until the decision structure is completed
For comparison the program C4S for learning decision trees from examples was also applied to
this same problem (Quinlan 1990) The experiment was done with C4S using the default window
63
setting (maximum of 20 the number of examples and twice the square root the number of
examples) and set the number of trials set to one C45 was chosen for the comparative studies
because it one of the most accurate and efficient systems for learning decision trees from examples
and because it is widely available
The C45 program has the capability for generating a decision tree over a window of examples (a
randomly-selected subset of the training examples) It starts with a randomly selected window of
examples generates a trial tree test this tree against the remaining examples adds some
unclassified examples to the original ones and continues until either all training examples are
classified conectly or it can not produce a better tree Figure 4-3 shows the decision tree that is
learned by C45 When this decision tree was tested against 115 testing examples only 97
examples were classified correctly and 18 were mismatches
x6
ComplexIty No ofnodes 17 No of leaves 43
Figure 4-3 A decision tree learned by C45 for the wind bracing data
Figure 4-4 shows a decision structure learned in the default setting of AQl)12 parameters from
AQI5C rules It has 5 nodes and 9 leaves Thsting this decision structure against 115 testing
examples results in 102 examples matched correctly and 13 examples mismatched
64
Figure 4-5 shows a decision structure obtained from the rules in Figure 4-2 under the condition
that xl cannot be measured Leaves marked represent situations in which a definite decision
cannot be made without knowing xl This incomplete decision tree was tested on 115 testing
examples from which value of xl was removed The decision structure classified 71 examples
correctly 14 incorrectly and 30 were assigned the (indefinite) decision The leaves can be
replaced by sets of candidate decisions with their corresponding probability distribution
omplextty No of nodes 5 No ofleaves 9
Figure 4-4 Adecision structure learned from AQ15c wind bracing rules
Complextty
x2 -~--
No of nodesmiddot 6 No of leaves 8
Figure 4-5 A decision structure that does not contain attribute xl
Figure 4-6 presents a decision structure from Figure 4-5 in which leaves were assigned candidate
decisions with decision class probability estimates Let us consider node x2 The example
frequencies were w1=31 tw1=45 w2=11 tw2=139 w3=O tw3=169 and w4=5 tw4=5 Using
equation (11) the probability estimates for classes C1 C2 C3 and C4 under node x2 can be
approximated as P(Cl)= 66 p(C2) = 23 P(C3)=O and P(C4) =11
Figure 4-7 shows the decision structure resulting from decision rules in Figure 4-2 after they were
truncated under the assumption of a 10 noise level (this means that the rules whose combined tshy
65
weight represented 10 or less coverage of the training examples in a given class were removed)
The predictive accuracy of this decision structure on the testing data was 88 (in contrast to 89
for the decision structure in Figure 4-4)
~ ~63tCl 66 ~ f1l 37
Complexity No of nodes 5
1C2)23 Cl53 No of leaves 7 ~ 11 eJ47
Figure 4-6 A decision structure without Xl with candidate decisions assigned to leaves
Complexity 2v3 No of nodes 3
C2 No ofleaves 5
Figure 4-7 A decision structure learned from the wind bracing rules after prunning all rules cover less than 10 of the training examples per a decision class
To demonstrate changes in the concept description learned by AWf-2 with different decisionshy
making situations four attributes were selected for visualizing the change in the learned concept
after changing the cost ofdifferent attributes Starting with the decision structure in Figure 4-4 in
the flfst situation x5 was given high cost AQDJ2 generated a decision structure with four nodes
and six leaves The predictive accuracy of this decision structure was 861 The second decisionshy
making situation had xl being given high cost AWf-2 learned a decision structure with five
nodes and seven leaves The predictive accuracy of this decision structure was 791
Figure 4-8 shows a diagrammatic visualization of decision trees learned by AQDT-2 in normal
situations then when x5 was unavailable and when xl was unavailable The diagram is simplified
using only the four attributes which used in building the initial decision trees The visualization
diagram indecates different shades for different decision classes Another shade is used to illustrate
cells that require another attribute to correctly classify the testing examples (eg x3 x4 or x7)
66
Also white cells indecate that an accurate decision can not be driven from the rules without
knowing the value of the removed attribute In such cases multiple decision can be provided with
their appropriate probabilities
means the system can not produce a decision without the missing attribute Figure 4-8 Diagrammatic visualization ofdecision trees learned for different decisionwmaking
situations for the wind bracing data
Experiments with Subsystem I The initial part of the experiments involved running AQ15c
for a set of learning problems with 18 different parameter settings for AQ15c (two types of
decision rules--cbaracteristics or discriminant three coverage modes--intersecting disjoint or
ordered ie decision lists and three beam search widths (I 5 and 10raquo The two settings that
gave the best results in tenns of predictive accuracy laquoChr Dij 10gt and laquoehr Inr 1gt see Table
4w2) were selected for experiments with Subsystem IT
These experiments were performed for four learning problems (three MONKs problems [Thrun
Mitchell amp Cheng 1991] and the Wind Bracing problem [Arciszewski et aI 1992]) The best two
parameter settings of AQ15c were selected for testing different parameter settings of AQD12
Thble 4-2 shows the predictive accuracy of rules learned by AQI5c from examples and the
predictive accuracy of decision structures learned by AQJJr-2 from the decision rules Each value
in this table is an average value of predictive accuracy of running either one of the two programs
100 times on 100 distinct randomly selected training data of the given size Each of these runs was
tested with testing examples that represent the complement of the training example seL
67
Figure 4-9 shows diagrams illustrating the difference in the predictive accuracy between AQl5c
and AQITI-2 with different parameter settings of AQ15c The term ltintIgt denotes intersecting
covers ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means
discriminant rules and the number is the width of the beam search
Experiments with Subsystem II In these experiments parameters of Subsystem I are fixed
and selected parameters of Subsystem n are modified The experiments were performed on
characteristic decision rules that were learned in intersecting or disjoint modes For each data
set the results reported from each experiment is calculated as the average of 1()() runs of different
training data for 9 different sample sizes The parameters changed in this experiment were the
68
threshold of pre-pruning of the decision rules and the degree of generalization of the AQJJT-2
algorithm (Michalski amp Imam 1994) The latter parameter is the ratio of the number of examples
covered by rules belonging to different decision classes at a given node of the decision
structuretree
GIl ltDisj Char 1gt ltIntr Char 1gt 9 95 95
90r 90 85e 85 g 80
7Se 80
7S8 70
-
70 6S as 60 60
o 20 40 60 80 100 o 20 40 60 80 100
-J ~
--eoshy-_ bull
AQM AQlSc
bull
middot
middot -l ~-
~ LfZE AQDT
middot ----- AQlSc
I bull ltDisj Disc 1gt ltIntrDisc1gt ~
i 9S 9S
90 90
Mrn 85 85
~~ 80 80 ~t~ 7St 70 70 vS as asIi 60 60ltos 0 20 40 60 80 10Q 0 20 40 60 80 100
The relative sample sizes () of the ~~ data The relative sample sizes () of the training data Figure 4-9 The accuracy ofA(l1J12 and AQl5c with different parameter settings for
the wind bracing Problem
Figure 4-10 shows the changes in the predictive accuracy of decision structures learned by Awrshy
2 with different parameter settings The default curve means predictive accuracy learned in the
default setting of AQDT-2 The default pre-pruning threshold is 3 and the default generalization
degree is 10 The results show that with the wind bracing data it is better to reduce the
generalization degree to 3 However changing the pre-pruning degree did not improve the
predictive accuracy
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDT-2 and C45 program for learning decision trees (Quinlan 1990) Both systems
were set to their default parameters All the results reported here are the average of 100 runs For
~
~ 1-1
V ~
EJo- AQM--- AQlSc
I I
shy
1
J ~
~ IIIoooo
1----EJo- AQM--- AQlSc
I bull
69
94
WIND
~
r-
S
~ fI
~ -(
4 -4 -1shy -
I ) T-C ~
bull Default
ff~ ~ ~tlen-shy - gm
bull bull bull
each data set we reported the predictive accuracy the complexity of the learned decision trees and
the time taken for learning Figure 4-11 shows a simple summary of these experiments
IIIgt91 ltlntr Chargt91
(S Id 87 87 -5 ~e 83 83gtsect L~ 0 79 79lJjwsect 75 75obi Ftel 8 71 71 lt e 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-10 Analyzing different parameter setting ofAQDT-2 using the wind bracing data
--- Iq1f
WIND ~-
~ - -11-1 -it ~
r c rl gt )c shytJH 0 Default
I E 8mun lSpnm ~len--0-shy gen
--0- 00--- 81)1
-0- AQISc
v-
V t
( t ~ ~
L
1-01 1shy o 1020 3040 SO 60 70 80 90 100 0 10203040 SO 60 7080 90 100 0 10 203040 SO 607080 90100
Relative size of training examples () Relative size of training examples () Relative size of training examples ()
Figure 4-11 Acomparison betweenAQl5c andAQIJf-2 against C45 on the wind bracing data
43 Experiments With Small Size Simple and NoiseFree Problems MONK
This subsection describes an experimental analysis of the AWT approach on the MONK-1
problem The MONKs problems (Thrun Mitchell amp Cheng 1991) involve learning classification
rules for robot-like figures MONK-1 requires learning a DNF-type description The data consists
of two decision classes Positive and Negative and six attributes xl head-shape (values are
octagonal square or round) x2 body-shape (values are octagonal square or round) x3 isshy
smiling (values are yes or no) x4 holding (values are sword flag or balloon) x5 jacket-color
(values are red yellow green or blue) and x6 has-tie (values are yes or no)
~ rl
~ If ---shy NPf
--0shy AQl5c--0-shy C45
- ~ -
I(
--- AQM --0- au
11184 0972 OJ9
60 i 07oIiS
It deg24
~ 12 OJI Z 0 M
70
The original problem was to learn a concept from 124 training examples (62 positive and 62
negative) These training examples constitute 29 of all possible examples (432) thus the
density of the training examples is relatively high Figure 4-12 shows a visualization diagram
obtained the DIAV program (Wnek amp Michalski 1994) of the training examples (positive and
negative) and the concept to be learned Figure 4-13 shows the decision rules learned by AQl5c
from the MONK-l problem Table 4-3 shows a comparison between the evaluation of the A~2
criteria on the MONK-l problem Figure 3-10 shows two different decision structures learned
when using different criteria
11211 t2J1121112 112111211121112 112111211121112 1121314 1121314 1 1 2 1 3 1 4
1 2 3
~1gtc x6 x5 x4
Figure 4-12 A visualization diagram of the MONK-l problem
The AQDT-2 program running in its default mode and with the optimality criterion set to minimize
the nwnber of nodes (ie the disjointness criterion is ranked flISt) produced a decision tree with
71
41 nodes For comparison the program C45 for learning decision trees from examples was also
applied to this same problem
Positive rules Negative rules 1 [x5 = 1] 1 [xl = 1][x2 = 2 3][x5 = 2 4] 2 [xl =3][x2 = 3] 2 [xl =2][x2 = 13][x5 = 2 4] 3 [xl = 2][x2 = 2] 3 [xl =3][x2 = 12][x5 = 2 4] 4 [xl =1][x2 = 1]
Figure - S10n rules learn
The C45 program did not produce a consistent and complete decision tree when run with its
default window size (max of 20 and twice the square root of number of examples) nor with
100 window size After 10 trials with different window sizes we succeeded in making C45
produce the optimal decision tree as AQUf2 (using the window size of 725) This tree is
presented in Figure 4-14 Also in the same experiment AQ17-DCI (Bloedorn et al 1993) was
used to derive decision rules using constructive induction AQ17-DCI generates a new attribute that
takes value T when the value ofxl equals the value ofx2 and takes value F otherwise These
rules were
Pos lt= [x5=1] v [x1=x2] and Neg lt= [x5 1] amp [xl x2]
attribute selection criteria for the MONK -1 problern
From these rules the system produced the compact decision structure presented in Figure 4-15-b
It should be noted that decision structures in Figure 4-14 4-15a and 4-15b are all logically
equivalent and they all have 100 prediction accuracy on testing examples (which means that they
72
represent exactly the target concept) By running AQDT-2 to learn a decision structure a simpler
decision structure was produced (Figure 4-15-a)
x5
xl
Complexity No of nodes 13
P - Positive N - Negative No of leaves 28
Figure 4-14 The decision tree for the MONK-I problem generated ~yAQUf-I
Complexity x5No of nodes 5
No of leaves 7
Complexity No of nodes 2
P - Positive N - Negative P - Positive N - Negative No of leaves 3
a) Compact decision structure for AQI5 rules b) Compact decision structure for AQI7 rules Figure 4-15 Compact decision structures generated by AQUf-2 for the MONK-Iproblem
Experiments with Subsystem I As was mentioned earlier the initial part of the experiments
involved running AQI5c for a set of learning problems with 18 different parameter settings for
AQI5c (two types of decision rules-characteristic or discriminant) three coverage modesshy
intersecting disjoint or ordered ie bull decision lists) and three widths of the beam search (1 5 and
10) The two settings that gave best results in terms of predictive accmacy laquoCh Dij 10gt and
laquoCh Int 1raquo were selected for experiments with Subsystem ll These experiments were
perfonned on the MONK-I problem Table 4-4 shows the predictive accuracy of rules learned by
73
AQI5c from examples and the predictive accuracy of decision structures learned by AQIJf-2 from
the decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs has tested with a testing example set that represented the complement of the training example
set Figure 4-16 shows diagrams illustrating the difference in the predictive accuracy between
AQ15c and AQD2 using these settings of AQI5c The terms ltIntrgt denotes intersecting covers
74
ltDisjgt means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant
rules
ltDisj Char 1gt ltlntr Char 1gtVI
2 102 ~
J JI
iii- AQDT ---- AQlSc
bull bull bull
102Q
~ 96 96
~ 90 90
84 84
-~
78 788 72 72
110sect 66 66
B 60 60
r ~
I r ii
p
I - AQDT --- AQlSc
bull bull bull 0 o 10 20 30 40 50 60 70 80 o 10 20 30 40 50 60 70 80
ltDisj Disc 1gt ltlntr Disc 1gt~ 102 102 gt 96 -
96 ~ 90 90 ~ 84~2 84 ~Qm ~
l~n n 0_ 66 66
g ~
60 60 ~ 0 0 10 20 30 40 SO 60 70 80 0 10 20 30 40 50 60 70 80
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-16 The accuracy ofAQDT-2 andAQ15c with different parameter settings for
the MONK-l Problem
Experiments with Subsystem II the same experiments were performed on the MONK-l
problem The parameters of Subsystem I were fixed and selected parameters of Subsystem II were
modified The experiments were perfonned on characteristic decision rules that were learned in
intersecting or disjoinettt modes For each data set the results reported from each experiment
were calculated as the average on 100 runs of different training data for 9 different sample sizes
The parameters to be changed in this experiment were the threshold of pre-pruning of the decision
rules and the degree of generalization of the AQDl=-2 algorithm (Michalski amp Imam 1994) Figure
4-17 shows the changes in the predictive accm3Cy of decision structures learned by AQDT-2 with
different pammeter settings The default curve means predictive accm3Cy learned in the default
setting of AQDl=-2 The default pre-pruning threshold is 3 and the default generalization degree
shy --~ --~ --
lL shy ~rshy
rJ
K a- AQDT
1----- AQlSc
bull bull
--~ -- --~ --~ Il
1 rr J
III AQDT
----- AQlSc
bull bull I
75
96
102
secton+-~~~~~~~~ 5 OJII +-~OL=p-bllOOt~F--I--I 4)
~o~~~~~--4-4~~
is 10 The results show that with the MONK-l data it is slightly better to reduce the
generalization degree to 3 However increasing the pre-pruning degree did not improve the
predictive accuracy
MONK-l ltllsj Chargt 102 102
86C)l
90-5 90
84~184 ~~ 78 78
~sect 72 72
~~ 66 66 ~middots 60 60 ~ e
-J y
~ - --
I r
l~
W~ bullI 3 S lfIII1
31Icu --0-- 2Ogcu bull bull
0 10 20 30 40 50 60 70 80 80 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training daca The relative sample sizes () of the training daIa Figure 4-17 Analyzing different parameter ~tting ofAC1Jf-2 with the MONK-l data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQUf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the leamed decision trees and the time taken for learning Figure 4-18 shows simple summary
of these experiments
MONK-l ltlntr Chargtbull JiI j - -shy
~ ~~ fI - shy
I c
I F- Default
bull 8urun lSurunbull 3lIe11bull--D-shy 2Ogen
bull bull bull
MONK-l MONK-l
J
rf )S u-
gt 1 -- NPf
-1)- AQlSc --a-- CAS bull
90U80 j70
bull If
~
~ -
J
--- AQDT -11 C45
8 016
d60
i ~30 )20 E 10
Ool f-r-i-rl--i--+~0 o 10 20 30 40 SO 60 70 80
Relative size of training eJtamples (4L) Relative size of training eumples (ltltII) Relative size of tnUning examples () OW2030~~60W80 OW2030~~60W80
Figure 4-18 ComparingAQlSc andAQDT-2 against C45 using the MONK- problem
I
76
44 Experiments With Small Size Complex and Noise-Free Problems MONK-2
The MONK-2 problem requires a system to learn a non-DNF-type description (one that cannot be
easily described as a DNF expression using its original attributes) The problem is described in
similar way to the MONK-l problem The data consists of two decision classes Positive and
Negative and six attributes xl head-shape (values are octagonal square or round) x2 bodyshy
shape (values are octagonal square or mund) x3 is-smiling (values are yes or no) x4 holding
(values are sword flag or balloon) x5 jacket-color (values are red yellow green or blue) and
x6 has-tie (values are yes or no) The original problem was to leam a concept from 169 training
examples (62 positive and 62 negative) These training examples constitute 40 of all the possible
examples (432) Figure 4-19 shows a visualization diagram of the training exaIIlples (positive and
negative) and the concept to be leamed
t-shy shyCI ~shy ~ CI CI ~~ f- C)
CI
-CI- -CI - -CI
~
CI
~
C)
CI
- shyCI ~ -f- CI C)
CI -~ f- C)
CI
~~llt x6
~~+-~+-~~~~~-r~-+~-+-+-~~~~~~~~X5 ~______________~ ______________~x4______________~
Figure 4-19 A visualization diagram of the MONK-2 problem
77
Experiments with Subsystem I The two settings that gave best results in tenns of predictive
accuracy were the same as the other problems laquoCh Dij 10gt and laquoCh Int 1raquo They were
selected for experiments with Subsystem II Table 4-5 shows the predictive accuracy of rules
learned by AQ15c from examples and the predictive accuracy of decision structures learned by
AQDT-2 from these decision rules
Each value in that table is an average value of predictive accuracy of running either one of the two
programs 100 times on 100 distinct randomly selected training data of the given size Each of these
runs is tested with a testing examples that represents the complementary of the training examples
78
Figure 4-20 shows a diagram illustrating the difference in the predictive accuracy between AQ15c
and AQDT-2 using these settings of AQ15c The tenns ltlntrgt denotes intersecting covers ltDisjgt
means disjoint covers ltChargt indicates characteristic rules ltDiscgt means discriminant rules and
the number is the complexity of the beam search
~ ~
~
~
~
i A ltA
~ IIJo- ASOf---- A lSc
I bullbull I bull o 10 2030 40 50 60 70 80 90 100 i) ltDisi Disc 1gt
ltIntr Char 1gt SJbull bull gt 100
90
8080
70 70
6060
5050 o 10 20 30
ltIntr Disc 1gt ~ 100 100
i 9090
~ bull 80 80 ~fj
70 Jj 170
i~ 6060 4)5 50 50
40r 40
~ shy
i-I
m-shy AQDT---_ AQlSc
~ ~
~ ~
40 50 60 70 80 90 100
~
-
~----_
AQDT AQlSc
~
~j ~
~ AQDT
----- AQlSc
lt~ 0 10 203040 50 60708090100 0 102030 40 5060 708090100 The relative sample sizes () of the ttaining data The relative sample sizes () of the training data
Figwe 4-20 The accuracy ofAQIJI2 and AQ15c with different parameter settings for the MONK-2 Problem
Experiments with Subsystem IT Again the parameters of Subsystem I are fIxed and
selected parameters ofSubsystem II are modifIed For each data set the results reported from each
experiment was calculated as the average of 100 runs of different training data for 9 different
sample sizes The parameters to be changed in this experiment were the threshold ofpre-pruning of
the decision rules and the degree of generalization of the AQJJf-2 algorithm (Michalski amp Imam
1994) Figure 4-21 shows the changes inthe predictive accuracy of decision structures leamed by
AQIJI2 with different parameter settings The default curve means predictive accuracy learned in
the default setting of AQIJI2 The default pre-pruning threshold is 3 and the default
generalization degree is 10 The results show that with the MONK-2 data it is slightly better to
79
reduce the generalization degree to 3 However increasing the pre-pruning degree did not
improve the predictive accuracy
85 MONK-2 ltDisj Chargt as MONK-2 ltlntr Chargt ~~ 77 ~
77
~~ 69 gt8shy
J~ L shy 1 69
~ u8
61 r- - ~- -1 ~ _I
afault mun
61
~ -Of 53
(1 ~ -lt t 0 10 20 30 40 50
=Fshy-
60 -shy-a-shy
70 80
IS 3 lien 2()11
DO
pnm
len 100
53
~ 0 10 20 30 40 50 60 70 80 DO 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figln 4-21 Analyzing different parameter setting ofAQDf2 with the MONK-2 data
Comparative Study This sub-section presents a comparison between the decision trees
obtained by AQDf-2 and the C45 program for learning decision trees Both systems were set to
their default parameters The experiments were divided into two parts All the results reported here
are the average of 100 runs For each data set we reported the predictive accuracy the complexity
of the learned decision trees and the time taken for learning Figure 4-22 shows a simple summary
of these experiments
~
A
~ ~ Ii k i shy )0-( -
~ - I Default
--8shy RIInrun ISpnm
--1shy 311_ 2Ot co
MONK-2 MONK-2
c c 1
V lA t ~
=
AIPfbull
--r- CAs-0- AQlSc
1- 10
MONK-2100 m
~ lim cd
iL1) ~~ ~~ degtl46 37 )40
~
J~ i 04
f
~~
-- AQDT
-~ 015 28 ~o 00
AfI1f AQISc-0shy 00- --0-
--6- ~I
r A
~ ~ A ~ ~~ tl rp r--r~~ c
o 10 20 30 40 so 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 SO 60 70 80 90100 Re1alive size of trainins examples (II) Relative size of bllining examples (II) Re1aIive size of training examples ()
Figln 4-22 Comparing AQl5c and AQDf-2 against C45 using the MONK-2 problem
45 Experiments With Small Size Simple and Noisy Problems MONK-3
MONK-3 requires learning a DNF-type description from noisy data The problem is described in
similar way to the MONK-l and MONK-2 problems The data shares the same attributes the same
80
domains and the same decision classes as the ftrst two MONKs problems Figure 4-23 shows a
visualization diagram of the training examples (positive and negative) and the concept to be
learned The minus signs in the shaded area and plus sign in the unshaded area are considered
noisy examples Noisy examples are examples that assigned the wrong decision class
r)211 J211J2 11T1211T2111211j211FptPI= Figwe 4-23 A visualization diagram of the MONK-3 problem
Experiments with Subsystem I Thble 4-6 shows the predictive accuracy of rules learned by
AQI5c from examples and the predictive accuracy of decision structures learned by AQJYr-2 from
these decision rules Each value in that table is an average value of predictive accuracy of running
both of the two programs 100 times on 100 distinct randomly selected training data sets of the
given size Each of these runs was tested with a testing examples that represented the complement
of the training example set Figure 4-24 shows diagrams illustrating the difference in the predictive
accuracy between AQl5c and AQDT-2 using the most important setting of AQ15c
81
Experiments with Subsystem ll In these experiments the parameters of Subsystem I the
learning process were fixed and selected parameters of Subsystem IT the decision making
process were changed The results reported from each experiment was calculated as the average of
100 runs of different training data for 9 different sample sizes Figure 4-25 shows the changes in
the predictive accuracy of decision structures learned by AQJJf-2 with different parameter settings
The default pre-pruning threshold is 3 md the default generalization degree is 10 The results
show that with the MONK-3 data it is usually better to reduce the generalization degree Also
increasing the pre-pruning threshold does not improve the predictive accuracy
bull bull
82
-- -- - -- r]
94
90
86
82
78
ltDisj Char 1gt
shy -
U-- A817I---e-- A Ix
I I bullbull bull
ltlntr Char 1gt102 __-r--r-rr--r~r--r--TI
~~~~~~~~~~ 94-1---hopound+-+-+--f-J---I--I--+-t
OO -+---+--+--+--t---I---
sa -I-f--t--t---r--+-shyIr-JL--i--I
82 -+---+--+--+--1_shy
78 ~r-I-+-or+If__rIt
o o 10 20 30 40 50 60 70 80 90 100 o 10 20 30 40 50 60 70 80 90 100 ltDisj Disc 1gt ltlntr Disc 1gt bi - 102 102 --- - -~-~ -shy shy
i = = ~90 90~J~~86 86 shy~~ ~ ~EL~ 78 78 AQ17IBo~ ~ n ---- AQlSc~middotS 70 70 I IltS 0 10 2030 40 50 60 7080 90100 0 10 20 30 40 50 60 70 80 90100
The relative sample sizes () of the ~~ldata The reJative sample sizes () of the training data Figure 4middot24 The accuracy ofAl1Jl~2 andAQ15c with different parameter settings for
the MONK-3 Problem
100 bull 100 MONK 3 ltDis 0IaIgt ~ i-
~
~r ~
I bull r ~auJl
~-a-e- ----
oa 98~188 96~ ~ 116
~sect ~~~ ~
i~ Q2 Q2lt t 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The relative sample sizes () of the training data The relative sample sizes () of the training data Figure 4-25 Analyzing different parameter setting ofAcpr2 with the MONK-3 data
Comparative Study Figure 4-26 presents a comparison between the decision trees obtained by
AQJJf-2 and C45 Both systems were set to their default parameters Figure 4-26 shows a
summary of the predictive accuracy the complexity of the learned decision trees and the learning
time The reason that there is a drop in the predictive accuracy at some sample sizes is due to the
fact that the testing data is not fixed for each sample In other words one error may represent
MONKmiddot3
ir
~
~
DofmII1_==s= _ ~
83
101
101 error rate when testing against 90 of the data and the same error may represent 10 when
testing against 10 of the data These curves do not repr~sent the learning curve
MONKmiddot3 MONKmiddot3MONKmiddot3 20
iI(j 19 020 -18
I-I
11
-- AQIJI
-D CA
ic1l17 ~ OJSg 16 c s OJo ~ IS
~ ~ 14 i=oosi 13 11 000
ow~~~~~wwoo~ OW~~~~~W~OO~ o 10 20 30 40 SO 60 70 80 90 10( Relative size of training examples () Relative size of raining examples (~) RemtivesizeofraUrlngexwmples()
Figure 4-26 Comparing AQI5c and AQDT-2 against C45 using the MONK-3 problem
Z 12
46 Experiments With Large Complex and Noise-Free Problems Diagnosing
Breast Cancer
The breast cancer database is concerned with recognizing breast cancer The data used here are
based on real cases collected and grouped by William WolbeIg (Mangasarian amp WolbeIg 1990)
The data has 699 examples represented using ten attributes and grouped into two decision classes
(Benign and Malignant) The ten attributes are 1) Sample Code Number 2) Oump Thickness 3)
Unifonnity of Cell Size 4) Unifonnity of Cell Shape 5) MaIginal Adhesion 6) Single Epithelial
Cell Size 7) Bare Nuc1ei 8) Bland Chromatin 9) Normal Nucleoli and 10) Mitoses All attributes
except the sample code number had a domain of ten values (there were scaled)
In this experiment the parameter setting for AQ15 and AQJJf-2 were set to their default and the
experiment was performed to compare decision trees learned by both AQJJf-2 and C45 All the
results reported here were based on the average of 100 runs For each data set we reported the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-27 summarizes these experiments
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the factmiddot that
the testing data is not fixed for each sample In other words one error may represent 101 error
- 4 shy7
if I I iI
____ NPC
-0- AQISc --a-shy C45
---- NJf1t
-0- NJI5c bull --0-shy W V--r- HlD-I ~
~ ~ r ~
~~~-4 101 1
bull bull bull bull
84
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
r I
J J~~
~
I ~
I AQDT --a-- C4S
bull I I
BreastCancer140 IJ97 = ~~ ~~ U r ~~ t193 ill 1119 III 91
1I $ CJ6
~ ~m III
J~ 40 u 87 Z m DJI
~
~J1a - roj -1
--- NPf AQ15c--0shy --a-- C4s
-shy Jq1f ~~ -0- AQISc
bull -II 00 -a- Hll)1 iI V)
r ~
bull V
~~ ~ 1-1 1 i-04 -I o 10 20 30 40 SO 60 70 80 90 100 0 10 ~O 30 40 SO 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
Relative size of training examples () Relative SIZe of uaining examples () Relative size of training examples ()
Figure 4-27 Comparing AQlSc and AQDT-2 against C4S using the breast cancer problem
47 Experiments With Large Complex and Noisy Problems Mushroom
Classifications
Learning from the Mushroom database involves with classifying mushrooms into edible or
poisonous classes The data was drawn from the Audubon Society Field Guide to North American
Mushrooms The data consists of 8124 examples A random sample of 810 examples was selected
to perform the experiment Each example was described by 22 attributes These attributes are 1)
Cap-shape 2) Cap-smface 3) Cap-color 4) Bruises 5) Odor 6) Gill-attachment 7) Gill-spacing
8) Gill-size 9) Gill-color 10) stalk-shape 11) stalk-root 12) stalk-smface-above-ring 13) stalkshy
smface-below-ring 14) stalk-color-above-ring IS) stalk-color-below-ring 16) veil-type 17) veilshy
color 18) ring-number 19) ring-type 20) spore-print-color 21) population and 22) habitat
To perform the experiment the parameter setting for AQlS and AQIJI2 were set to their defaults
and the experiment was performed to compare decision trees learned by both AQDT-2 and C4S
All the results reponed here are the average of 100 runs For each data set we reponed the
predictive accuracy the complexity of the learned decision trees and the time taken for learning
Figure 4-28 shows simple summary of these experiments
In this problem C4S produces better accuracy with more complex decision trees (almost twice the
size of decision trees generated by AQDJ2) while taking slightly more time to produce such trees
85
The average difference in accuracy is less than 2 the average difference in tree complexity is
greater than 10 nodes and the average difference in learning time is about the same
The reason that there is a drop in the predictive accuracy at some sample sizes is due to the fact that
the testing data is not fixed for each sample In other words one error may represent 101 error
rate when testing against 90 of the data and the same error may represent 10 when testing
against 10 of the data These curves do not represent the learning curve
Musbroom Mulbroam100 ~~ ~~
~I~ jM5 ju94 8 I~ ~n
~ Is ~ ~ 4
~
r
~) 4 -- AQDT
--a- OLS OJ 00
--6shy
~----NlDT
-0shy Nlamp --a-shy OIS
IlU)I
L
C
~
P
L ~~
o 1020 3040 SO 60 70 80 ~100 0 10203040 SO 60 70 so ~100 o 10 20 30 40 SO 60 70 80 90 100
Relative size of ampraining examples ltIIgt Relative size of ampraining examples ltII) Relative size of ampraining examples IIgt
Figure 4-28 ComparingAQ1Sc andAQfJf-2 against C45 using the mushroom problem
48 Experiments With Small Structured and Noise-Free Problems East-West
Trains
Learning 1ltsk-oriented Decision Structures from structural data This subsection
briefly illustrates the capabilities of the ACJf-2 system for learning task-oriented decision
structures The experiment involved the East-West problem (Michie et al 1994) whose goal was
to classify a set of trains into two classes eastbound and westbound The data was structured such
that each train consisted of two to four cars Each car was described in terms of two main
features- the body of the car and the content of the car The body of the car was described by 6
different attributes and the load of the car was described by two attributes
The original description of the trains was given in Prolog clauses The AWf-2 program accepts
rules or examples in the fonn of an array of attribute-value assignments It can also accept
examples with different numbers of attribute-value pairs (ie examples of different length) To
86
describe the train problem in a fannat suitable for AQJ1f-2 a set of eight (8) attributes was
generated such that they could completely describe any car in the train see Table 4-7 Each train
was described by one example of varying length To recognize the number (position) of a given car
in the train each of the eight attributes was associated with a two-digit code (ij) the first identifies
the location of the car and the second identifies the number of the attribute itself For example the
number 3 in the attribute-name x32 refers to the third car and the number 2 refers to the second
attribute (the car shape) In other words attribute x32 is the label of the attribute describing the
shape of the third car
Table 4-7 The set of attributes and their values used in the trains prc)OlCml
stands for the car number (14)
Decision-making situations In the first decision-making situation a decision structure that
classifies any given train either as eastbound or westbound was learned using only attributes
describing the first car (Figure 4-29-a) This decision structure classified correctly 19 trains (out
of 20) The error occurred when the value ofx17 equaled 1 (ie the load shape of the first car was
hexagonal) Figure 4-29-b shows the decision structure learned for the decision-making situation
where only attributes describing the second car are used in classifying the trains It correctly
classified 18 of the trains
Both decision structures have leaves with mUltiple decisions which means there are identical first or
second c~ in the two decision classes Figure 4-29-c shows a decision structure learned using
attributes describing the third car only It correctly classifies each of the 14 trains with three cars or
more(14) In Figure 4-29-d a similar decision-making situation was given but x37 and x34 were
87
given lower cost than x31 Both decision structures classified the 14 trains with three or more cars
correctly These last two decision structures classified any train with three or more cars correctly
and classified the other 6 trains correctly using a flexible matching method (Michalski etal 1986)
E
INodeS 4 ILeaves 9
a) Decision structures learned using b) Decision structure learned using only descriptions of Car 1 only descriptions of Car 2
xli
Leaves 6
c) Decision structure learned using only descriptions of Car 3
Figure 4-29 Decision structures learned by AQJJf-2 for different decision-making situations
49 Experiments With Small Size Simple and Noisy Problems Congressional
Voting Records (1984)
In the Congressional Voting problem each example was described in terms of 16 attributes There
were two decision classes and a total of 216 examples The experiments tested the change in the
number of nodes and predictive accuracy when varying the number of training examples used for
generating a decision tree by AQJJf-2 and C45 This experiment was done with C45 using two
window options the default option (maximum of 20 the number of examples and twice the
square root the number of examples) and with 100 window size (one trial per each setting) In
the Congressional Voting-1984 problem the sizes of the set of training examples were 8 16
24 31 39 and 52 of the total number of training examples (216 examples in total half of
the examples were in one class and the second half in the other class)
88
Table 4-8 and Figures 4-30 a and b show the results graphically for the Congressional voting-1984
problem The results indicate that AQJJT-2 generated decision trees had a higher predictive
accuracy and were simpler than decision trees produced by C4S Also the variations of the size of
the AQDT-2s trees with the change of the size of training example set were smaller
Table 4-8 A tabular summary of the predictive accuracy of decision trees obtained and C4S for the data
96 10
9 95 I 8 8
~ III 7IPa 94 0
6f Iie 93 S 5 lt
Col
i 492
3
91 2
5 10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 25 30 35 40 45 50 55 60 Relathe slzeofthe tralDlng eumples (i) Relative size oItbe training eumples (i)
a) Accuracy of the decision tree as a function b) Size of the decision tree as a function of of the size of the set of training examples the size of the set of the training examples
Figurrt 4-30 Comparing decision trees for the Congressional bting-84 data learned by C4S amp AQf1f-2
410 Analysis of the Results
This section includes an analysis of the results presented in sections 4-2 to 4-9 The analysis
covers the relationship between different characteristics of the input data and the learning
parameters for both subfunctions of the approach A set of visualization diagrams are used to
t~ bull ~ middotbull bullbull
bullbullbullbullbullbull bull
I=1= ~
~ lJ
bull bullbullbullbull
bull bullbull4i
I I
I 1
bull bull l
1=1=shy ~-canbull
I I I bull
89
illustrate the relationship between concepts represented by decision rules and concepts represented
by decision tree which learned from these rules This section also includes some examples on
describing different decision making situations and the task-oriented decision structures learned for
each situation
Table 4-9 shows the best parameter settings for learning decision rules with different databases
The information in this table is based on the predictive accuracy of decision trees learned by AQIYrshy
2 from decision rules learned by AQ15c with different parameter settings (see Tables 4-2 4-4 4-5
and 4-6) Some heuristics were used in driving these infonnation One heuristic was if the
difference in predictive accuracy between two widths of the beam search is less than 2 then the
smaller is better Another one was if the predictive accuracy of different types of covers is
changing (Le for one type of covers it is higher with some widths of beam search or with certain
rules type and lower with others than another type of covers) the best cover is determined
according to the best width of the beam search and the best rule type
Table 4-9 Summary of the best parameter settings for the first subfunction of the approach with different data characteristics
It was clear that AQDT-2 works better with characteristics rules rather than discriminant In most
problems when changing the width of the beam search of the AQl5c system the changes in the
predictive accuracy of decision trees learned by AQD1=-2 were within 2 Disjoint rules were better
than intersected rules for learning decision trees Generally decision trees learned from intersected
rules were slightly bigger than those learned from disjoint rules
To analyze the comparative study between AQIYr-2 and C45 for learning decision trees a set of
heuristics were used to summarize these results These heuristics are 1) if the difference between
90
the average predictive accuracy of the two systems is within plusmn 2 the predictive accuracy is
considered to be the same Otherwise the predictive accuracy is considered high or low 2) if the
average learning time is within plusmn 01 seconds the learning time is considered the same
Table 4-10 shows a summary of the comparison between AWf-2 and C4S The summary
includes comparing the predictive accuracy the size of learned decision trees and the learning
time The value in each cell refers to the system which perform better (possible values are AQDTshy
2 C4S and Same) When the two systems produced similar or close results a letter is associated
with the value Same to indicate which one have advantage (eg C for C4S and A for AQDJ2)
Table 4-10 Summary of the performance of AQDJ2 and C4S on different problems The white cells shows the system which perform better SameX means similar perJfomoan(~e of both is better ifX=A and C4S is better if
Some conclusions can be driven from these comparison When the training data represents a small
portion of the representation space AWf-2 produces bigger but accurate decision trees
However C4S produces smaller but less accurate decision trees When the training data
represents a very large portion of the representation space AWf-2 usually produces smaller
decision trees with better accuracy except with noisy data The size of decision trees learned by
C4S relatively grow higher when the training data increases Also C4S works better than AWfshy
2 with noisy data The reasons of this are because AQDJ2 over generalizes the decision rules and
C4S uses a window for learning decision trees The learning time of AWf-2 is supposed to be
much less than that of C4S However in some data sets it takes more time because there are some
91
situations where there is no enough information to reach decision the program goes into a loop of
testing all attributes The probabilistic approach for handling this problem is not implemented yet
1b explain the relationship between the input to and the output from AQDT-2 and to explain some
of the comparison between AQDT-2 and C45 the rest of this subsection presents a set of
diagramatic visualizations (Wnek amp Michalski 1994) illustrating these issues
Consider the MONK2 problem the original problem is used here for analyzing the AQDT-2
system The experiment contains 169 training examples for both positive and negative decision
classes Figure 4-31 shows a visualization diagram of the decision rules learned by AQ15c The
shadded areas represent decision rules of the positive decision class The white areas represents
non-positive coverage
r)211 J2P2111T)211 J2rt2111T1211 J2fJ 2111215
Figure 4-31 A visualization diagram of decision rules learned by AQ15c for the MONK-2
92
Figure 4-32 shows the testing results of the decision rules learned by AQl5c All shadded cells
with bull indicate a false positive errors (AQl5c classifies this cell as positive while it should be
negative) Also all non-shadded cells with indicate a false negative errors (AQl5c classifies this
cell as negative while it should be positive)
rPI J2 jJ21 1 i)21 JT11 JT121 FJJ2 1]2~
Figure 4-32 A visualization diagram shows the testing errors for the rules in Figure 4-31
Figure 4-33 shows a visualization diagram of the decision tree learned by AQI1T-2 In this
diagram cells shadded with m indecate portions of the representation space that were classified as
positive by both AQl5c and AQJJT-2 Cells shadded with bull are portions of the representation
space that were classified as positive by AQl5c but as negative by AQDT-2 Cells shadded with
~ represent portions of the representation space where AQDT-2 over generalized decision rules
belong to the positive decision class The decision tree shown by this diagram was learned with
default settings (ie with 10 generalization threshold)
93
This diagram shows the relationship between the input to and the output from AQlJf-2 Unlike the
MONK-l problem over generalizing the concept of the MONK-2 problem reduces the accuracy
This can be shown in Figure 4-34 which show the errors obtained by AQDT--2 decision tree
UNr)21 ITI21lT J21J2 p21lTJ21jPI12~ MM
Figure 4-33 A visualization diagram of the decision tree learned by AQlJf-2 e MONK-2
Figure 4-34 is similar to Figure 4-34 but with illustration of the false positive and false negative
errors Cells with bull indicate portions of the representation space with false positive errors Cells
with bull represents portions of the representation space with false negative errors By comparing
Figures 4-34 and 4-32 more errors were occured because of the over generalization
94
rn2P21 Tj)21 iJ21 TIi12 ii)21 itTJ21 illr bull bull Figure 4-34 A visualization diagram shows testing errors of AQI1f-2 decision tree
rnITl liJT12I iPJlTIT )21 iJTJ2 IT~ Figure 4-35 A visualization diagram of a decision tree learned by AQI1f-2
after reducing the generalization degree to 1
CHAPTER 5 CONCLUSION
51 Summary
TIlls thesis introduces the concept of a decision structure and describes a methodology for
efficiently determining a single-parent structure from a set of decision rules A decision structure is
an acyclic graph that defines a conditional order of tests for arriving at a classification decision for a
given object or situation Having higher expressive power than the familiar decision tree a
decision structure is able to represent some decision processes in a much simpler way than a
decision tree
The proposed methodology advocates storing the decision knowledge in the declarative form of
decision rules which are detennined by induction from examples or by an expert A decision
structure is generated on line when is needed and in the form most suitable for the given
decision-making situation (ie a class of cases of interest) A criticism may be leveled against this
methodology that in order to detennine a decision structure from examples it is necessary to go
through two levels of processing while there exist methods that produce decision trees efficiently
and directly from examples Putting aside the issue that decision structures are more general than
decision trees it is mgued here that this methodology has many advantages that fully justify it The
main advantages include 1) decision structures produced by the methods in the experiments
conducted had higher predictive accuracy and were simpler (sometimes significantly so) than
deCision trees produced from the same data 2) decision structures produced from rules can be
easily tailored to a given decision-making situation ie they can avoid measuring expensive
attributes or can put them in the lowest parts of the structure 3) by storing decision knowledge in
the declarative fonn of modular decision rules the methodology makes it easy to modify decision
knowledge to account for new facts or changing conditions 4) the process of deriving a decision
structure from a set of rules is very fast and efficient because the number of rules per class is
95
96
usually much smaller than the number of examples per class and 5) the presented method produces
decision structures whose nodes can be original attributes or constructed attributes that extend the
original knowledge representation (this is due to the application of constructive induction programs
AQI7-DCI and AQI7-HCI) The price for these advantages is that the system has to generate
decision rules first and then create from them decision structures In the AQDJ2 method this first
phase is done by an AQ algorithm-based rule learning method While past implementations of AQshy
based methods were computationally complex the most recent implementation is very fast
(Bloedorn et al 1993) thus the decision rule generation phase can be done quite efficiently
The current method has a number of limitations and several issues need to be investigated further
First of all there is need for further testing of the method Although the experiments conducted so
far have produced more accurate and simpler decision structures than decision trees obtained in a
standard way for the same input data more experiments are necessary to arrive at conclusive
results A mathematical analysis of the method has not been performed and is highly desirable
The current method generates only single-parent decision structures (every node has only one
parent as in a decision tree) Extending the method to generate full-fledged decision structures (in
which a node can have several parents) will make it more powerful It will enable the method to
represent much more simply decision processes that are difficult to represent by a decision tree
(egbull a symmetric logical function) The decision structures produced by the method are usually
more general than the decision rules from which they were created (they may assign decisions to
cases that rules could not classify) Further research is needed to determine the relationship
between the certainty of decision rules and the certainty of decision structures derived from them
The AQ-based program allows a user to generate both characteristic and discriminant decision rules
(Michalski 1983) There is need to investigate the advantages and disadvantages of generating
decision structures from different types of rules
52 Contributions
The dissertation introduces the concept of a decision structure and describes a methodology for
efficiently deterrnining a single-parent structure from a set of decision rules A major advantage of
97
the proposed methcxl is that it allows one to efficiently detennine a decision structure that is
optimized for any given decision-making situation For example when some attribute is difficult
to measure the methcxl creates a decision structure that shows the situations in which measuring
this attribute can be avoided The methcxl is quite efficient and the time of detennining a decision
structure from decision rules in the cases we investigated was negligible Therefore it is easy to
experiment with different criteria for structure generation in order to obtain the most desirable
structure
Another advantage of the AQJ1T-2 methcxl is that decision structures obtained this way tend to be
simpler and have higher predictive accuracy than when those obtained in a conventional way Le
directly from examples In the experiments involving anificial problems and real-world problems
AQJ1T-2-generated decision structures have outperformed those generated by the well-known C45
decision tree learning program in most problems both in terms of average predictive accuracy and
average simplicity of the generated decision trees
The AQDT-2 methcxl uses the AQ15 or AQ17-DCI learning programs for this purpose Since the
methcxl is independent it could potentially be applied also with other decision rule leaming
systems or with decision rules acquired from an expert
REFERENCES
98
99
Arciszewski T Bloedorn E Michalski R Mustafa M and Wnek J (1992) Constructive Induction in Structural Design Report of the Machine Learning and Inference Labratory MLI-92-7 Center for AI GeOIge Mason Univerity
Bergadano F Giordana A Saitta L DeMarchi D and Brancadori F (1990) Integrated Learning in Real Domain Proceedings of the 7th International Conference on Machine Learning (pp 322-329) Austin TX
Bergadano F Matwin S Michalski R S and Zhang J (1992) Learning Twoshytiered Descriptions of Flexible Concepts The POSEIDON System Machine Learning bl 8 No I pp 5-43
Bloedorn E Wnek J Michalski RS and Kaufman K (1993) AQI7 A Multistrategy Learning System The Method and Users Guide Report of Machine Learning and Inference Labratory MLI-93-12 Center for AI Geoxge Mason University
Bohanec M and Bratko I (1994) Trading Accuracy for Simplicity in Decision 1iees Machine Learning Iournal Vol 15 No3 Kluwer Academic Publishers
Bratko Land Lavrac N (1987) (Eds) Progress in Machine Learning Sigma Wilmslow England Press
Bratko I and Kononenko I (1986) Learning Diagnostic Rules from Incomplete and Noisy Data in BPhelps (edt) Interactions in AI and statistical method Gower Technical Press
Breiman L Friedman JH Olshen RA and Stone CJ (1984) Classification and Regression nees Belmont California Wadsworth Int Group
Clark P and Niblett T (1987) Induction in Noisy Domains in I Bratko and N Lavrac (Eds) Progress in Machine Learning Sigma Press Wilmslow
Cestnik B and Bratko I (1991) On Estimating Probabilities in free Pruning Proceeding ofEWSL91 (pp 138-150) Porto Portugal March 6-8
Cestnik B and KaraJic A (1991) The Estimation of Probabilities in Attribute Selection Measures for Decision nee Induction Proceedings of the European Summer School on Machine Learning Iuly 22-31 Priory Corsendonk Belgium
Gaines B (1994) Exception DAGS as Knowledge Structures Proceedings of the AAAI International Workshop on Knowledge Discovery in Databases pp 13-24 Seattle WA
Hart A (1984) Experience in the use of an inductive system in knowledge engineering Research and Developments in Expert Systems M Bramer (Ed) Cambridge Cambridge University Press
Hunt E Marin J and Stone P (1966) Experiments in induction New York Academic Press
Imam IF and Michalski RS (1993a) Should Decision frees be Learned from Examples or from Decision Rules Lecture Notes in Artificial Intelligence (689) Komorowski 1 and Ras Zw (Eds) pp 395-404 from the proceeding of the 7th International SymXgtsium on Methodologies for Intelligent Systems ISMIS-93 1iondheiJn Norway June 15-18 Springer erlag
100
Imam IE and Michalski RS (1993b) Learning Decision Trees from Decision Rules A method and initial results f1um a comparative study in Journal of Intelligent Information Systems JIIS hI 2 No3 pp 279-304 KerschbeIg L Ras Z amp Zemankova M (Eds) Kluwer Academic Pub MA
Imam IE Michalski RS and Kerschberg L (1993) Discovering Attribute Dependence in Databases by Integrating Symbolic Learning and Statistical Analysis Techniques Proceeding of the AAA International Workshop on Knowledge Discovery in Database Washington DC July 11-12
Imam IE and vafaie H (1994) An Empirical Comparison Between Global and Greedyshylike Seanh for Feature Selection in the Proceedings of the Seventh Florida Artificial Intelligence Research Symposium (FLAIRS-94) pp 66-70 Pensacola Beach Florida May
Imam IE and Michalski RSbullbull (1994) From Fact to Rules to Decisions An Overview of the FRO-I System in the Proceedings of the Fourth AAA-94 Workshop on Knowledge Discovery in Databases July Seattle Washington
Kohavi R (1994) Bottom-Up Induction of Oblivious Read-Once Decision Graphs Strengths and Limitations Proceedings ofthe Twelfth National Conference on Artificial Intelligence AAAI hI I pp 613-618 AAAI Press I MIT Press
Kohavi R amp Li C (1995) Oblivious Decision frees Graphs and Top-Down Pruning in the Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence pp 1071-1077 Montreal Canada August 20-25
Mangasarian OL and Wolberg WH (1990) Cancer Diagnosis via Linear Programming SIAM News Vol 23 No5 September pp 1-18
Michie D Muggleton S Page D and Srinivasan A (1994) International EastshyWest Challenge Oxford University UK
Michalski RS (1973) AQVAU1-Computer Implementation of a Variable-Valued Logic System VL1 and Examples of its Application to Pattern Recognition Proceeding of the First International Joint Conference on Pattern Recognition (pp 3-17) Washington DC October 30shyNovember 1
Michalski RS (1978) Designing Extended Entry Decision 1llbles and Optimal Decision frees Using Decision Diagrams Technical Repon No898 Urbana University of illinois March
Michalski RS (1983) A Theory and Methcxlology of Inductive Learning Artificial Intelligence Vol 20 (pp 111-116)
Michalski RS Mozetic I Hong J and Lavrac N (1986) The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains Proceedings ofAAAI-86 (pp 1041-1045) Philadelphia PA
Michalski RS (1990) Learning Flexible Concepts Fundamental Ideas and a Methcxl Based on Two-tiered Representation in Y Kodratoff and RS Michalski (Eds) Machine uaming An Artificial I ntelligence Approach Vol III San Mateo CA Mmgan Kaufmann Publishers (pp 63shy111) June
101
Michalski RS and Imam IR (1994) Learning Problem-Optimized Decision 1tees from Decision Rules The AQDT-2 System Lecture Notes in Artificial Intelligence Springer erlag from the 8th International Symposium on Methodologies for Intelligent Systems ISMIS Charlotte North Carolina October 16-19
Mingers J (1989a) An Empirical Comparison of selection Measures for Decision-Tree Induction Machine Learning Vol 3 No3 (pp 319-342) Kluwer Academic Publishers
Mingers J (1989b) An Empirical Comparison of Pruning Methods for Decision-Tree Induction Machine Learning Vol 3 No4 (pp 227-243) Kluwer Academic Publishers
Niblett T and Bratko I (1986) Learning decision rules in noisy domains Proceeding Expert Systems 86 Brighton Cambridge Cambridge University Press
Quinlan JR (1979) Discovering Rules By Induction from LaIge Collections of Examples in D Michie (Edr) Expert Systems in the Microelectronic Age Edinbmgh University Press
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games in RS Michalski JG Carbonell and TM Mitchell (Eds) Machine Learning An Artificial Intelligence Approach Los Altos MOtgan Kaufmann
Quinlan JR (1986) Induction of Decision frees Machine Learning Vol 1 No1 (pp 81-106) Kluwer Academic Publishers
Quinlan JR (1987) Simplifying Decision frees International Journal of Man-Machine Studies 27 (pp 221-234)
Quinlan J R (1990) Probabilistic decision trees in Y Kodratoff and RS Michalski (Eds) Machine Learning An Artificial Intelligence Approach Vol III San Mateo CA MOtgan Kaufmann Publishers (pp 63-111) June
Quinlan J R (1993) C4S Programs for Machine Learning MOtgan Kaufmann Los Altos California
Smyth P Goodman RM and Higgins C (1990) A Hybrid Rule-basedlBayesian Classifier Proceedings ofECAI 90 Stockholm August
Sokal R and Rohlf F (1981) Biometry Freeman Pub San Francisco
Thrun SB Mitchell T and Cheng J (1991) (Eds) The MONKs Problems A Performance Comparison of Different Learning Algorithms Technical Report Carnegie Mellon University October
Wnek J and Michalski RS (1994) Hypothesis-driven Constructive Induction in AQI7-HCI A Method and Experiments Machine Learning Vol14 No2 pp 139-168 Kluwer Academic Publishers
Vita
Ibrahim M Fahmi Imam was born on March 5 1965 in Cairo Egypt He received his BSc in
Mathematical Statistics from the Department of Mathematics Cairo University Egypt in 1986 He
received a Graduate Diploma in Computer Science and Information from the Department of Computer Science Cairo University Egypt in 1989 He received his MS in Computer Science from the Department of Computer Science GeOIge Mason University Fairfax Vnginia in 1992 Ibrahim worked as a knowledge engineer for a United Nation (UN) project for developing expert systems for improving the crop management from 1989 to 1990 In 1990 he visited GeOIge Mason University for six months to conduct research in Machine Learning and Knowledge Acquisition In Fall of 1991 he joined the graduate program at GMU Over the past three years (1993-95) Ibrahim published over 11 papers in refereed journals books conferences or workshop proceedings and 4 technical reports His system A(1Jf-2 was ranked highly in a world wide competition (oIganized by Oxford University) on Machine and Human mtelligence Two solutions obtained by that program ranked second and third in one competition
Other two solutions ranked sixth and seventh among 65 entries from around the world in another competition Ibrahim is the founder and co-chair of the first international workshop on mtelligent Adaptive Systems (lAS-95) Melbourne beach Florida 1995 He served on the OIganizing committee of the Florida Artificial Intelligence Research Symposium FLAIRS-95 the program committee of Florida Artificial Intelligence Research Symposium FLAIRS-96 He was also involved also in oIganizing many other conferences and workshops including the AAAI-93 and AAAI-94 conferences the fllSt and second Workshops on Multistrategy Learning (MSL-91 amp MSL-93) and the IENAIE-94 conference He is a member of the American Association for Artificial mtelligence AAAI Ibrahims PhD titled Deriving Task-oriented Decision Structures from Decision Rules was supervised by Professor Ryszard S Michalski supported by the Center for Machine Learning and Inference GMU The Center is supported in pan by grants from ARPA ONR NSF Air Force and other oIganizations Ibrahim research interest focuses in the area of machine learning intelligent agent and adaptive
systems knowledge discovery in databases hybrid classification and knowledge-base systems