and random forest algorithms a thesis

STUDY OF THE RELATIONSHIP OF TRAINING SET SIZE TO

ERROR RATE IN YET ANOTHER DECISION TREE

AND RANDOM FOREST ALGORITHMS

by

RATHEESH RAGHAVAN, B.E.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty

of Texas Tech University in

Partial Fulfillment of

the Requirements for

the Degree of

MASTER OF SCIENCE

Approved

Susan Mengel

Chairperson of the Committee

Yu Zhuang

Accepted

John Borrelli

Dean of the Graduate School

May, 2006

ii

ACKNOWLEDGMENTS

I thank Almighty god for providing the strength and knowledge to pursue the research

work.

I would like to express my sincere gratitude to Dr. Susan Mengel, the chairperson of

my committee for guiding through the research. In spite of her busy schedule, she was very

helpful in solving the problems and presented new insights. It was a pleasure to work

alongside with her which has helped shaped my academic objectives.

I am glad to have Dr. Yu Zhuang as my committee member. He has provided me

immense moral support during the entire course of my research. I would like to thank him for

his full co-operation and support for successful completion of my thesis research.

I would like to thank Ms. Colette Solpietro and other members at Office of Research

Services for the support and confidence towards me.

Finally, I would like to thank the most important people in my life, my parents, and

my brother. I am here due to the immeasurable confidence and sacrifices made by them for

me. I hope this work stands up to the high standards you have always expected of me.

iii

ABSTRACT

Classification algorithms are the among the widely used data mining techniques

for prediction. Among their different types, the decision tree is a classification predictive

model with significant advantages over the other techniques by being easy to interpret,

having quick construction, having high accuracy and using fewer resources. The decision

tree model can be developed by algorithms like C4.5, CART, YaDT, and Random Forest

where their performance is determined by error rates. This thesis research studies the

relationship of training data size to error rate for the YaDT and Random Forest

algorithms, and also compares the performance of both of them with the results of C4.5

& CART.

This thesis research has been helpful in drawing various conclusions. For

example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to

80:20 for large data sets with more than 1000 samples to generate more accurate decision

tree models. The stability of all algorithms in the research is weak after 90:10 ratios due

to very little testing data. This thesis research reveals that while YaDT performs similarly

to C4.5 and CART, the performance of Random Forest is better than the other three

significantly. The performance of models can be determined optimally with large data

sets.

iv

TABLE OF CONTENTS

ACKNOWLEDGMENTS...................................................................................................ii

ABSTRACT.......................................................................................................................iii

LIST OF FIGURES...........................................................................................................vii

LIST OF TABLES ...........................................................................................................viii

CHAPTER........................................................................................................................... 1

I. INTRODUCTION .......................................................................................................... 1

1.1 Motivation of Thesis ......................................................................................... 3

1.2 Goal of Thesis ................................................................................................... 3

1.3 Thesis Organization........................................................................................... 3

II. LITERATURE REVIEW............................................................................................... 5

2.1 Classification in data mining............................................................................. 5

2.1.1 Decision Tree Model...................................................................................... 7

2.2 YaDT algorithm ................................................................................................ 8

2.2.1 Introduction .................................................................................................... 8

2.2.2 Entropy........................................................................................................... 8

2.2.3 Information Gain .......................................................................................... 10

2.2.4 Techniques to evaluate information gain for continuous attributes ............. 13

2.2.4.1 Technique One Quick Sort method............................................... 15

2.2.4.2 Technique Two Counting Sort method ......................................... 16

2.2.4.3 Technique Three Rain Forest algorithm........................................ 18

2.2.4.4 Selection of Technique.................................................................. 19

2.2.5 Pruning Decision Trees ................................................................................ 21

2.2.5.1 Example of EBP............................................................................ 23

2.2.6 Optimal Parameters of YaDT algorithm ...................................................... 25

2.2.7 Summary ...................................................................................................... 25

2.3 Random Forest algorithm ........................................................................................ 26

2.3.1 Introduction ....................................................................................................... 27

v

2.3.2 Gini Index .......................................................................................................... 27

2.3.3 Operation of Random Forest ............................................................................. 28

2.3.4 Optimal Parameters of Random Forest ............................................................. 34

2.3.5 Summary............................................................................................................ 35

2.4 Error Rate Estimation...................................................................................... 35

2.4.1 Holdout Method ........................................................................................... 36

2.4.2 Repeated Holdout Method ........................................................................... 36

2.4.3 Cross-Validation Method ............................................................................. 36

2.5 Evaluation of Algorithm Performance ............................................................ 37

2.5.1 Two-tailed Test Comparison........................................................................ 37

2.6 Related Studies................................................................................................ 39

2.7 Summary ......................................................................................................... 41

III. RESEARCH METHODOLOGY................................................................................ 45

3.1 Research Methodology ........................................................................................... 45

3.2 Software Selection .................................................................................................. 46

3.3 Data Collection ....................................................................................................... 47

3.3.1 Data Preparation........................................................................................... 48

3.4 Model Construction......................................................................................... 49

3.4.1 Model Training............................................................................................. 51

3.4.2 Model Testing .............................................................................................. 51

3.5 Relationship of Training data to Error rate ............................................................. 51

3.6 Comparison of Four algorithms .............................................................................. 52

3.7 Summary ......................................................................................................... 53

IV. RESEARCH RESULTS ............................................................................................. 55

4.1 Data Preparation.............................................................................................. 55

4.1.1 Optimal Parameter values ............................................................................ 56

4.1.1.1 YaDT algorithm ............................................................................ 56

4.1.1.2 Random Forest .............................................................................. 58

4.2 Error Rate Generation ..................................................................................... 58

4.3 Error Rate Variation ............................................................................................. 63

vi

4.4 Relationship of Training Data Size to Error Rate.................................................... 67

4.5 Statistical Performance Comparison of Four Algorithms...................................... 71

4.6 Summary.................................................................................................................. 74

V. CONCLUSIONS AND FUTURE WORK .................................................................. 76

5.1 Conclusions ..................................................................................................... 76

5.2 Future Research............................................................................................... 78

BIBLIOGRAPHY............................................................................................................. 79

APPENDICES................................................................................................................... 81

A. DATA SET INFORMATION.......................................................................... 82

B. TABLES ......................................................................................................... 120

C. FIGURES........................................................................................................ 137

vii

LIST OF FIGURES

1.1 Data mining as a step in the process of knowledge discovery...................................... 2

2.1 Classification Model ..................................................................................................... 6

2.2 Decision Tree Model..................................................................................................... 7

2.3 ACT split ..................................................................................................................... 12

2.4 Final Decision Tree ..................................................................................................... 13

2.5 First step of Binary Search .......................................................................................... 16

2.6 Second step of Binary Search...................................................................................... 16

2.7 Different types of fitting in model construction.......................................................... 22

2.8 Decision tree and when node Imp is decision node .................................................... 24

2.9 Decision tree when node Imp is selected as a leaf node.............................................. 25

2.10 Ensemble Learning.................................................................................................... 27

2.11 Partially Constructed Decision Tree by the CART Algorithm ................................. 33

2.12 Fully constructed decision tree by the CART/Random Forest ................................. 33

2.13 Holdout Method ........................................................................................................ 37

3.1 Research Methodology................................................................................................ 44

4.1 Error Rate Variation from the Connect-4 Data Set by YaDT..................................... 61

4.2 Error Rate Variation from the Connect-4 Data Set by Random Forest ...................... 61

4.3 Relationship of Training Data Size to Error Rate – Pen Digits Data Set.................... 64

4.4 Relationship of Training Data Size to Error Rate – Letter Recognition Data Set ...... 65

4.5 Relationship of Training Data Size to Error Rate – Chess Data Set ........................... 66

viii

LIST OF TABLES

2.1 Balloons Dataset.......................................................................................................... 11

2.2 Information Gain of all attributes in the Balloon Dataset ........................................... 13

2.3 Weather dataset ........................................................................................................... 14

2.4 TEMP attributes with corresponding class values....................................................... 14

2.5 TEMP attribute array without repeated values............................................................ 15

2.6 TEMP attribute with corresponding entropies values ................................................. 15

2.7 Initialization of position in the array........................................................................... 17

2.8 Occurrence count of a value in the array..................................................................... 17

2.9 Count the number of lesser or equal values ................................................................ 18

2.10 Final Sorted array...................................................................................................... 18

2.11 Attribute TEMP with class values............................................................................. 19

2.12 Array AVC[c][i]........................................................................................................ 19

2.13 Techniques and their required number of steps to calculate information gain ......... 21

2.14 Results of the three techniques for Insurance dataset in Table2.3 ............................ 22

2.15 Decision Node with the count of right and wrong predictions ................................. 23

2.16 Error count of two pruning methods for the decision tree in Fig. 2.8....................... 25

2.17 Insurance Dataset ...................................................................................................... 30

2.18 Partitions after the Binary Split on HOME_TYPE ≤ 6 by the Random Forest ......... 31

2.19 Partitions after the Binary Split on HOME_TYPE ≤ 10 by the Random Forest ...... 32

2.20 Gini Split Calculations for HOME_TYPE Attribute ................................................. 32

2.21 Random Tree values for a Single Record.................................................................. 35

3.1 Machine Specifications ............................................................................................... 45

3.2 Characteristics of the selected data sets ...................................................................... 46

3.3 Pruning parameters of YaDT algorithm for Chess data set ........................................ 48

4.1 Splitting Results of the Optical Digits Data Set.......................................................... 52

4.2 Pruning Results of OptDigits Dataset in YaDT algorithm.......................................... 53

4.3 Average Error Rate from all Data Sets by YaDT and Random Forest algorithm....... 55

4.4 Results from the Chess Data Set by Random Forest with Boosting ........................... 57

ix

4.5 Results from the Chess Data Set by YaDT with Boosting.......................................... 57

4.6 Complete Results of OptDigits dataset in YaDT ........................................................ 59

4.7 Complete Results of OptDigits dataset in Random Forest.......................................... 60

4.8 Average error rate of all four algorithms in Pen Recognition dataset......................... 63

4.9 Abbreviated table YaDT vs. the other three algorithms.............................................. 68

1

CHAPTER I

INTRODUCTION

Every organization collects a large amount of data regarding customer profiles,

business transactions, market interests and other valuable information. However, this data

is seldom used in discovering important decision-making information. An imminent need

exists for turning such data into useful knowledge which can be used for prediction of

future trends.

Useful knowledge can be generated automatically from large data collections by data

mining methods. “Data Mining can be defined as the technology of employing one or

more computer learning techniques to automatically analyze and extract knowledge from

data contained within a database” [RIC 2000, pg 4]. This innovative, interdisciplinary

technology involves the integration of techniques from multiple disciplines, such as

database technology, statistics, machine learning and information retrieval. Specifically,

data mining searches large stores of data for patterns and delivers results that can be

utilized either in an automated decision support system or assessed by human analysts.

Data mining, however, is only a part of the extensive automatic knowledge extraction

procedure called KDD (Knowledge Discovery and Data Mining). A KDD process is an

iterative sequence of operations consisting of data cleaning and integration, data selection

and transformation, data mining, and pattern evaluation leading to knowledge

presentation. Figure 1.1 illustrates the classic KDD process.

2

Fig 1.1: Data mining as a step in the process of knowledge discovery [HAN 2000, Pg. 6]

An important technique in data mining is classification. For classification algorithms,

a set of training examples is used to find a model for the value of a user-specified target

attribute (the class) as a function of the values of other attributes in the data set. The

resulting model is applied on a set of testing examples and used to predict the class of

examples whose class is unknown. For example, Credit card applicants can be classified

as good or bad credit risks based on the factors affecting their finances.

Model performance is measured by error rate. Error rate is defined as the ratio of the

number of test set errors to the number of test set instances. This measure is very

important in comparing different classification models.

Classification models may be in the form of if-then statements or decision trees which

may be used in decision making. Two well known algorithms for constructing decision

3

trees are C4.5 (Quinlan 1993) and CART (Breiman 1984). More recent algorithms are

YaDT (Ruggieri 2004) and Random Forest (Breiman 2000) which seem to offer better

scalability and accuracy.

1.1 Motivation of Thesis

As mentioned, an important parameter to determine the efficiency and future

performance of a decision tree model is error rate computation. However, error rate can

be misleading because studies in the literature do not always show the relationship

between error rate and the size of training data. Therefore, a research study on the

relationship of training size data to error rate would be helpful in improving and

understanding the performance of decision tree algorithms.

1.2 Goal of Thesis

The primary goal of this thesis research is to study the relationship of training data

size to error rate for the two decision tree algorithms: YaDT (acronym for Yet another

Decision Tree Builder) and Random Forest. In specific, this research continues the work

of [ZHENG2004] who investigated C4.5 and CART in regards to the relationship

between error rate and training data size. The two algorithms, YaDT and Random Forest

are chosen due to their similarity to the C4.5 and CART algorithms and to their recent

publication. The results of all four decision tree algorithms are assessed by the statistical

comparison method, the two tailed test.

4

1.3 Thesis Organization

This thesis consists of five chapters. Chapter I present an overview of this

research. Chapter II provides a detailed introduction of the literature significant to this

research including general issues and examples of the classification algorithms. In

Chapter III, the methodology of this research is discussed and presented. The

comprehensive research description and results of the classification models are discussed

in Chapter IV. Finally, Chapter V presents the conclusions of this research and the

potential future research directions.

5

CHAPTER 2

LITERATURE REVIEW

This chapter details the literature relevant to this thesis research and is organized into

five basic sections. Section 2.1 introduces the basic concepts of classification and

decision tree modeling. Section 2.2 and 2.3 discuss the two algorithms, YaDT and

Random Forest. Section 2.4 details the error rate estimation method and statistical

evaluation of algorithmic performance. Section 2.5 summarizes the important issues and

topics covered in this chapter.

2.1 Classification in data mining

Classification modeling is used to predict a target or dependant variable in a data

set which can be partitioned into n number of classes or categories. When a classification

modeling algorithm has no prior information about the target variable, however, it forms

groups of classes of similar objects or clusters. When target variables are present,

classification models are built using supervised methods; otherwise, unsupervised

methods are used.

With supervised methods, a data set is supplied to the classification algorithm to

construct the model. The data set is split into two parts, the training data set and the

testing data set. The training data set is constructed by randomly selecting records from

the data set while preserving the original class distribution. The remaining records of the

data set form the testing data set.

6

A classification model is constructed by using classification algorithms, such as

C4.5 [QUIN], CART [BRE] or YaDT [YaDT2004]. C4.5 is a linear-search based

algorithm which uses entropy or average amount of uncertainty in the dataset to build

decision trees. CART is a decision tree algorithm developed by Dr. Leo Breiman which

uses a gini index or measure of inequality in the dataset for tree construction. YaDT is an

improved form of C4.5 by replacing the linear search with binary search and better tree

pruning methods.

Classification algorithms build models which are represented in the form of rules,

decision trees, or mathematical formulas. The testing data set is used on the model to

determine its accuracy through the error rate. The error rate is the ratio of the correctly

predicted samples to the total number of samples in the testing data set. If the error rate of

the model is acceptable, the model can be used to classify future data records; otherwise,

the classification model is rebuilt until an adequate accuracy rate is achieved. Figure 2.1

illustrates the classification process.

Fig 2.1: Classification Model

Apply the Interim classification model to testing set. Adjust the

model to reduce the error rate.

Generate an Interim

Classification Model.

FINAL

CLASSIFICATION

MODEL

Testing Set

Training Set

Interim

Classification

Model

7

In addition to decision trees, neural networks, case based reasoning, genetic

algorithms, and the rough set approaches are some of the classification techniques used to

construct predictive models. However, no single method has been found to be of better-

quality over all others for all data sets [STATLOG].

2.1.1 Decision tree model

A decision tree is a flowchart-like tree structure where each internal node denotes

a test on an attribute, each branch represents an outcome of the test, and leaf nodes

represent classes or class distribution. The topmost node is the root node [JIAWEI2003].

The construction of a decision tree is accomplished by means of a split where the

records in a node are divided into sub-nodes. The same procedure is applied on each of

the child nodes recursively until no more nodes can be split. Those nodes which cannot

be split any further to develop better prediction are called leaf nodes.

Figure 2.2 is an illustration of the decision tree model depicting the internal

nodes, branches, and leaf nodes.

Sunny Rain High Normal True False

Fig 2.2: Decision Tree Model [RIC2000, pg. 22]

OUTLOOK?

HUMIDITY? WINDY?

IN IN OUT OUT

8

The traversal of any given path from the root to any leaf leads to the construction

of a decision rule. In any decision tree, the collections of all the possible decision rules

make up the whole decision tree itself.

The general form of a decision rule is “If antecedent, then consequent”. The

antecedent represents the attribute values from the branches by the particular path

through the tree. The consequent corresponds to the classification value for the target

variable at the particular leaf node. The decision rules for the decision tree illustrated in

the Figure 2.2 are

• If Outlook is Sunny and Humidity is High, then stay In

• If Outlook is Sunny and Humidity is Normal, then move Out

• If Outlook is Rain and Windy is True, then stay In

• If Outlook is Rain and Windy is False, then move Out

2.2 YaDT algorithm

2.2.1 Introduction

Dr. Salvatore Ruggeri of the University of Pisa, Italy developed the EC4.5

algorithm [EC4.5] by improving the tree construction phase of the C4.5 algorithm

[QUIN]. Significant improvements in the time/memory utilization were found in EC4.5

algorithm against C4.5. However, the tree pruning process in EC4.5 was not efficient

and, therefore, the YaDT algorithm was developed by incorporating a new error based

pruning process. YaDT is an acronym for Yet another Decision Tree Builder and is

9

considered among the best classification algorithms in terms of time, space and accuracy.

The three algorithms C4.5/EC4.5/YaDT develop models relying on information measures

like entropy and information gain. They are discussed in detail in the following sections.

2.2.2 Entropy

Entropy is the measure which provides the average amount of uncertainty

associated with a set of probabilities [DtWEB]. It is used in ID3/C4.5/YaDT algorithms

to determine how disordered the attributes are in the dataset. Entropy is related to

information, in the sense that the higher the entropy, or uncertainty, of some data, then

the more information is required in order to completely describe that data. To understand

entropy, consider a class probability distribution ( )kpppP ,...,, 21= in a data set S, then

the entropy of S can be defined as:

( ) ( )∑=

−=m

i

ii ppSEnt1

2log (2.1)

where pi is the probability of class i in S, determined by dividing the number of samples

of class S by the total number of samples in S. Consider two class distributions

=10

3,

10

71C and

=10

5,

10

52C whose entropy values are E1 and E2 respectively.

Replacing these values in formula (2.1), the following is obtained:

=10

3,

10

71 EE

=

×+

×−10

3log

10

3

10

7log

10

722

= 0.2652 (2.2)

10

=10

5,

10

52 EE

=

×+

×−10

5log

10

5

10

5log

10

522

= 0.3010 (2.3)

From equations (2.2) & (2.3), entropy E2 is higher than E1 which implies that class

distribution C2 is more uncertain than class C1.

In building a decision tree, the aim is to decrease the entropy of the dataset until

the leaf nodes are reached with zero entropy representing instances all of one class (all

instances have the same value for the target attribute).

2.2.3 Information Gain

Information gain is utilized by the ID3/C4.5/YaDT family of algorithms as a

measure of the effectiveness of an attribute in classifying the training data. Information

gain is computed by measuring the difference between the entropy of the dataset before

the split and the overall entropy of the dataset after the split. The attribute with the

highest information gain is assumed to be the best splitting attribute and is the first

attribute to split the dataset. Discrete attributes normally appear once in a decision tree,

however, continuous attributes may appear more than once along any path through the

tree.

If the dataset D is split into n subsets with attribute X, the information gain of X in

a dataset D is calculated with the equation below:

11

∑=

−=n

i

ii DEDPDEDXGain1

)()()(),( (2.4)

where [ )∞∈ ,0),( DXGain and

( )DE is the entropy of the data set before the split on X;

( )iDE is the entropy of subset i after the split on X;

( )iDP is the probability of subset i after the split on X.

As an example, consider Table 2.1.

Table 2.1: Balloons Dataset [UCI2000]

ATTRIBUTES CLASS

Color Size Act Age Inflated

1 Yellow Small Stretch Adult True

2 Purple Large Stretch Adult True

3 Yellow Small Dip Child False

4 Yellow Small Stretch Child False

5 Purple Small Dip Child False

6 Purple Small Dip Adult True

7 Yellow Large Dip Adult True

8 Purple Large Stretch Child False

9 Yellow Large Dip Child True

10 Purple Small Dip Child False

12

With 10 different records, four attributes and one class, the possible class values are

True/False and are based on the values of the attributes. The number of true values is five

while remaining five constitutes the false values; so, the class probability of the dataset

is:

( )

=

10

5,

10

5, 21 ppP .

The calculation of the entropy of the data set before any attribute split based on the

equation (2.1) is:

=10

5,

10

52 EE

=

×+

×−10

5log

10

5

10

5log

10

522

= 0.6931 (2.5)

Consider the attribute Act in the dataset Balloon to be used first to split the values. Since

the attribute ACT has only two values, two distinct subsets B1 and B2 are formed. The

decision tree after the first split is shown in Figure 2.3.

Subset B1 Subset B2

Figure 2.3 ACT split

ACT

Stretch

True True True

False

Dip

False False True True False False

13

The individual probability of each subset is ( )

=10

41BP , ( )

=10

62BP and the class

distribution in each subset is ( )

=

4

1,

4

31bP , ( )

=

6

2,

6

42bP . The overall entropy of

all the subsets after the split in the Balloon data set is

OE [Balloon, Act] =

×+

×6

2,

6

4

10

6

4

1,

4

3

10

4EE

=

×+

×−×4

1log

4

1

4

3log

4

3

10

422

+

×+

×−×6

2log

6

2

6

4log

6

4

10

622

= 0.6067 (2.6)

The information gain of ACT is

[ ] [ ] [ ]ACTBalloonOEBalloonEActBalloonnGainInformatio ,, −=

= 0.6931 - 0.6067

= 0.0864 (2.7)

Table 2.2 contains the information gain for the remaining attributes.

Table 2.2: Information Gain of all attributes in the Balloon Dataset

ATTRIBUTE COLOR SIZE AGE ACT

Information gain 0.0523 0.0672 0.0724 0.0864

Because the attribute ACT has the highest information gain, it is chosen to be the splitting

attribute. The complete decision tree is shown in Figure 2.4

14

Figure 2.4 Final Decision Tree

2.2.4 Techniques to evaluate information gain for continuous attributes

Decision tree algorithms developed in the past have implemented a linear-based

search technique to compute information gain for continuous attributes. It is also used to

find the local threshold, the value at which gain is maximal.

OUTLOOK TEMP HUMIDITY WINDY PLAY

Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 70 96 False Yes


Rainy 65 70 True No

Overcast 64 65 True Yes

Sunny 72 95 False No

Sunny 69 70 False Yes


Sunny 75 70 True Yes

Overcast 72 90 True Yes

Overcast 81 75 False Yes

Rainy 71 91 True No

Act

Age

True

Age

Size

Dip Stretch

Child Adult

False

Child

True

Adult

True

Large Small

False

15

Consider the dataset in Table 2.3 and the continuous attribute TEMP, the node to

be split with a threshold value t. Initially, the values in the attribute are taken along with

its corresponding class values.

Table 2.4: TEMP attributes with corresponding class values

TEMP 85 80 83 70 68 65 64 72 69 75 75 72 81 71

Play N N Y Y Y N Y N Y Y Y Y Y N

While each discrete value is chosen, no repeated value is allowed to be placed

independently. This arrangement is then sorted using the quick sort algorithm and

displayed in Table 2.5

Table 2.5: TEMP attribute array without repeated values

TEMP 64 65 68 69 70 71 72 75 80 81 83 85

Play Y N Y Y Y N N/Y Y/Y N Y Y N

In the next step, a new value v is derived by calculating the average of two

adjacent attribute values. Using the new values, their corresponding entropies are

calculated and tabulated in Table 2.6

Table 2.6: TEMP attribute with corresponding entropies values

TEMP 63.5 64.5 66.5 68.5 69.5 70.5 71.5 73.5 77.5 80.5 82 84


Entropy .283 .268 .279 .282 .278 .269 .282 .282 .275 .282 .279 .248

The lowest entropy value implies the best split value and in this case, the best split

is at TEMP = 84. However the algorithm does not use this value in the final result. The

values in Table 2 .5 are linearly searched to find the actual lower TEMP value which is

16

closest to 84. At the end of the search, the TEMP value at 83 is chosen to be the threshold

or splitting value of the continuous attribute.

To improve upon the performance of handling continuous data, YaDT chooses the

best of the three techniques to evaluate the information gain and threshold of continuous

attributes. The functioning of the three techniques is discussed in detail in the following

sections.

2.2.4.1 Technique One Quick sort Method

The first technique of YaDT is similar to that employed by the C4.5 algorithm for

the tasks of sorting and finding the local threshold. The difference lies in the usage of a

binary search instead of linear search for finding the threshold from the local attribute

values. The example illustrates the working of the binary search in determining the

closest value to 84 which was mentioned in the previous section.

TEMP 64 65 68 69 70 71 72 75 80 81 83 85


↑ Pivot

Fig 2.5: First step of Binary Search

The midpoint value, 71, is chosen as the pivot and is compared to the search

parameter. Since the pivot is less, the desired value lies in the right side of the pivot. This

helps in halving the search space and improving the performance.

72 75 80 81 83 85

N/Y Y/Y N Y Y N

↑ Pivot

17

Fig 2.6: Second step of Binary Search

As illustrated in Figure 2.6, a new pivot, in this case 81, is chosen in the next step

and compared to the search parameter. This halving procedure is continued until the

correct value is obtained.

2.2.4.2 Technique Two Counting Sort Method

The second technique of YaDT calculates the threshold value using the on-the-fly

counting sort algorithm for sorting the threshold values. Counting sort is a linear time

sorting algorithm used to sort items when they belong to a fixed and finite set. It has an

order of ( )nO , but since it runs an extra scan of the data the order is in reality ( )nkO + .

An example of the technique is given below. The size of the initialized array with

attribute TEMP is equal to the number of values or 14 in this example. Each value is

assigned to its array position value as displayed in Table 2.7.

Table 2.7: Initialization of position in the array.

Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13

TEMP 85 80 83 70 68 65 64 72 69 75 75 72 81 71

Auxiliary 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Since there are repeated values, the array size is reduced by two and this step is

followed by setting the number of occurrences of each value in the corresponding

auxiliary array. The result is tabulated in Table 2.8.

Table 2.8: Occurrence count of a value in the array.

18

Position 0 1 2 3 4 5 6 7 8 9 10 11

TEMP 85 80 83 70 68 65 64 72 69 75 81 71

Auxiliary 1 1 1 1 1 1 1 2 1 2 1 1

The next step involves counting the number of occurrence of values less or equal

than each value of the TEMP attribute.

Table 2.9: Count the number of lesser or equal values.

Position 0 1 2 3 4 5 6 7 8 9 10 11

TEMP 85 80 83 70 68 65 64 72 69 75 81 71

Auxiliary 11 8 10 4 2 1 0 6 3 7 9 5

Based on the values in the auxiliary array, the TEMP attribute is rearranged to

form the sorted array. The final sorted array is shown in Table 2.10.

Table 2.10: Final Sorted array.

TEMP 64 65 68 69 70 71 72 75 80 81 83 85

Auxiliary 0 1 2 3 4 5 6 7 8 9 10 11

A counting sort is a sort algorithm which is efficient when the range of keys is

small and there are many duplicate keys. The first pass counts the occurrences of keys in

an auxiliary array, and then makes a running total so each auxiliary entry is the number of

keys equal or lesser than the value in the array. The second pass calculates the

information gain at each value of the attribute. This technique works very well when

there are many occurrences of a single value in the dataset.

2.2.4.3 Technique Three Rain Forest Algorithm

The third technique used in YaDT to calculate information gain is the main-

memory implementation of the Rain Forest algorithm [RAINFOREST]. It is to be noted

19

that this technique does not require sorting at all in the calculation of information gain.

The functioning of the algorithm can be elaborated by considering a continuous attribute

a. The range of indexes of attribute a for cases at a node is [low, high]. Table 2.11 shows

the initial table with the continuous attribute and class value.

Table 2.11: Attribute TEMP with class values.

TEMP 85 80 83 70 68 65 64 72 69 75 75 72 81 71

CLASS 0 0 1 1 1 0 1 0 1 1 1 1 1 0

An array [ ][ ]icAVC is filled in with the weighted sum of distinct cases with class c

and attribute value index i, with i ∑ [low, high]. The array contains all the required

information to calculate information gain and is shown in Table 2.12.

Table 2.12: Array AVC[c][i]

TEMP 85 80 83 70 68 65 64 72 69 75 75 72 81 71

CLASS 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1

CLASS 1 0 0 1 1 1 0 1 1 1 1 2 1 1 0

The next step is to calculate information gain. Information gain is computed by

considering splitting at different possible values using the AVC array. The value v’ at

which information gain is maximal is used to be the splitting value.

This method requires [ ]NClasslowhigh .1+− steps to initiate AVC, |T| steps to fill

in the AVC array and [ ]NClasslowhigh .1+− steps to compute the information gain. The

Rainforest algorithm is a generic decision tree schema. Although it is not explained

clearly in [YaDT2003], in most cases, this technique proves to be better than its peers.

20

2.2.4.4 Selection of Technique

This section is based on selection of technique as mentioned in [YaDT2003].

Consider the range of indexes [high – low] of attribute values for which information gain

has to be calculated. The value of integer d is equated to (high-low+1). T is the number of

cases in the dataset.

The working of techniques one and two is similar except for the choice of the

sorting algorithm. The counting sort algorithm requires 2.d+2.|T| steps in contrast to

Quick sort’s |)|'.log(.||. TcTc steps. Counting sort outperforms Quick sort in most

occasions and the equation is:

( )TcTcTd '.log...2.2 ≤+ …… (1)

Where the functions c and c’ model multiplicative constants due to actual

implementations of the algorithms.

Assumes α = d/|T| and the equation (1) can be rephrased as

( )||

'.log.2

||.2

T

Tcc

T

d≤+

( )||

'.log2.2

T

Tc

cc≤+

α

||'.4 c( Tc≤1)/+α

4(α+1)/c /c’ ≤ | T | ………. (2)

From the experiments, it was found that when α =16, 4(α +1)/c/c’ is approximately 1.

Therefore statement (2) can be simplified down to

4(α-16)/c ≤ | T | ………. (3)

On experimental approximation, the value of c is determined to be ¾ [EC4.5].

21

Against the previous techniques, Technique 3 does not sort and, therefore,

requires |T| + 2.d.NClass steps to calculate information gain of an attribute where NClass

is the number of classes present in the dataset.

Table 2.13 Techniques and their required number of steps to calculate information gain

TECHNIQUE No. of steps required

Technique One 2. |T|

Technique Two 4(α-16)/c

+ |T|

Technique Three |T| + 2.d. NClass

By calculating the steps required for determining the information gain of the

dataset in Table 2.3, the operations of the techniques can be explored. Since there are 14

records, the value of |T| = 14. While for the dataset in Table 2.3 the value of d is

calculated to be 16, α is equal to 1.14. The value of Nclass is 2 because there are two

distinct classes in the dataset.

Steps for Technique One = Steps to sort Array + Steps to calculate information gain

|||| TT +=

1414 += = 28 Steps

Steps for Technique Two = Steps to sort Array + Steps to calculate information gain

= 4(α-16)/c + |T|

= 13.18 + 14 = 27.18

= 27 steps

Steps for Technique Three = Steps to calculate information gain

22

= |T| + 2.d. NClass

= 78 steps

Table 2.14: Results of the three techniques for Insurance dataset in Table2.3

TECHNIQUE No. of steps to calculate information gain

Technique One 28

Technique Two 27

Technique Three 78

The technique that requires the least number of steps to calculate information gain

and the split value is chosen by YaDT; therefore, it is evident that technique two is used

to determine the information gain and split value in the dataset. The general selection of

technique is based on the criteria given below

• If )1/(1 −≤ NClassα ), technique (3) is chosen.

• Otherwise, if α ≤ 16 or 4(α -16).4/3 ≤ | T |, technique (2) is chosen.

• In the remaining cases, technique (1) is selected.

2.2.5 Pruning decision Trees

Theoretically a decision tree can be grown so as to have zero error rates on the

training set. However in such prediction models, the phenomenon of ‘overfitting’ creeps

into them. Overfitting occurs when decision trees characterize too much detail or noise in

the training data. Figure 2.7 illustrates the different types of fitting during model

construction.

23

Fig: 2.7 Different types of fitting in model construction [PRIN04, pg.7].

In the underfit, the model develops without considering most of the points and has

lower accuracy. In an ideal fit, an appropriate number of the points are measured to build

the model. In an overfit scenario, all the points including noise and unnecessary outliers

are considered which flaws the model accuracy.

To avoid overfitting, C4.5 implements error-based pruning (EBP). It uses training

set errors at a node where the decision tree is pruned in a single bottom-up transversal.

For example, at each decision node, three corresponding estimated error rates are

calculated and the lowest is chosen to prune the node.

a) The error in case the node is turned into a leaf.

b) The sum of errors of child nodes in case the node is left as a decision

node.

c) The error of grafting a child sub-tree in place of the node.

In YaDT, only methods (a) and (b) are conducted. It was found that as the dataset

size increases, the time and memory requirements of (c) were large and did not merit

consideration. Methods (a) and (b) are sufficient in most cases and consistently reduce

the decision tree size.

24

2.2.5.1 Example of EBP

Consider a decision tree with each node containing the correct predictions in the

left child and the error predictions in the right child.

Table 2.15 Decision Node with the count of right and wrong predictions.

A1

Correct (5) Wrong (5)

Figure 2.8 illustrates a completed decision tree with the prediction results at each

decision node. There are four levels of nodes in the decision tree. The attribute A3 is used

as the node to explain the example.

A1

12 10

Figure 2.8: Decision tree and when node Imp is decision node

In the first method, the node Imp is left as a decision node. At that point the sum

of errors in its child node is four. This is calculated by summation of the errors (4+0) of

the two nodes at Level 4.

A2

9 9

A2

3 1

FALSE

1 0

TRUE

3 0

FALSE

4 5

A3

5 4

FALSE

2 0

TRUE

3 4

T F

T F T

T

F

F

Imp

LEVEL 1

LEVEL 2

LEVEL 3

LEVEL 4

25

In the second method, the node Imp is left as a leaf node. The two nodes at the

Level 4 are removed. On calculation, the error of the leaf node Imp is found to be four.

Figure 2.9 illustrates the resulting decision tree.

A1

12 10

Figure 2.9 Decision tree when node Imp is selected as a leaf node.

After determining the error count through the two methods, the next step is to

choose one of the methods. The final decision tree is constructed based on this choice.

Table 2.16 shows the two methods and their respective counts of error for Attribute A3 of

the decision tree in Figure 2.9.

Table 2.16 Error count of two pruning methods for the decision tree

Method Type of Pruning Error Count

(a) Leaf Based Pruning 4

(b) Decision Node Based Pruning 4

The error count of method (a) is equal to that of method (b). The node for

attribute A3 is turned into a leaf node instead of a decision node. This pruning has helped

in reducing the size of the tree by one level. The pruning procedure is conducted at every

A2

9 9

A2

3 1

FALSE

1 0

TRUE

3 0

FALSE

4 5

A3

5 4

T F

T F T F

LEVEL 1

LEVEL 2

LEVEL 3 Imp

26

node of the tree to form a pruned decision tree without conceding the model accuracy.

The final tree is illustrated in Figure 2.9.

2.2.6 Optimal Parameters of YaDT algorithm

In the YaDT algorithm, the parameters whose values need to be optimal are

1. The pruning confidence level (P).

2. The minimum number of cases to split (M)

Pruning is the method of removing the least reliable branches in the tree and is measured

in confidence level from no pruning (0%) to 100%. The second parameter is the

minimum number of cases to be present in each sub-tree to initiate splitting. Together the

two parameters determine the size of the tree structure and the model error rate. In order

to find the ideal values for both the parameters, the error rate/tree size ratio is calculated.

The best parametric values are based on the highest error rate/tree size value.

2.2.7 Summary

The following is the summary of YaDT algorithm discussed in the above section

• YaDT algorithm was developed as a successor to ID3/C4.5/EC4.5 family of

decision tree algorithms.

• Divide and Conquer technique is used to split the trees into sub-trees which are

simplified through pruning.

• Entropy is used to find the uncertainty of attributes and information gain to

measure the effectiveness of the attributes in classification.

27

• Three different techniques are used to calculate information gain for continuous

attributes.

• Error-based pruning (EBP) is used to prune the trees. It is helpful in reducing

overfitting.

• YaDT algorithm performs with low construction time and memory usage.

2.3 Random Forest Algorithm

A data mining technique called “Ensemble learning” consists of methods that

generate many classifiers like decision trees and aggregates the results by taking a

weighted vote of their predictions is developed. Ensemble learning provides a more

reliable mapping that can be obtained by combining the output of multiple classifiers.

Figure 2.10 illustrates ensemble learning.

Fig 2.10 Ensemble Learning

Ensemble learning strategies demonstrate impressive abilities to improve the

prediction accuracy of base learning algorithms [Breiman 1993; Oza & Tumer 1999;

Classifier 1 Classifier 2

Classifier 3

Combiner

Input

Final Classifier C*

28

Tumer & Oza 1999; Wolpert 1992]. Boosting and Bagging are some of the most studied

ensemble learning algorithms.

2.3.1 Introduction

In 1984, Dr. Leo Breiman developed the CART algorithm [CART] based on the

classification and regression methodology. Techniques like pruning to reduce

overgrowing trees and binary-split search to avoid fragmented trees are incorporated in

the CART algorithm. It uses an information measure technique, gini index, to determine

the best split at each level.

In 2001, Dr. Leo Breiman developed the Random Forest Algorithm [BRE2001]

which is a collection of many CART trees that are individually developed. The

predictions of all trees are subjected to a voting procedure which aggregates the results.

The voting determines the prediction of the final class of the algorithm. This voting is

responsible for classifying Random Forest as a type of ensemble learning.

A Random Forest is a classifier consisting of a collection of tree-structured

classifiers ( ){ }....1,, =Θ kkxh where the kΘ are independently, identically distributed

random trees and each tree casts a unit vote for the final classification of input x. Like

CART, Random Forest uses the gini index for determining the final class in each tree.

The final class of each tree is aggregated and voted by weighted values to construct the

final classifier.

29

2.3.2 Gini Index

Random Forest uses the gini index taken from the CART learning system to

construct decision trees. The gini index of node impurity is the measure most commonly

chosen for classification-type problems. If a dataset T contains examples from n classes,

gini index, )(TGini is defined as

2

1

)(1)( ∑=

−=n

j

jpTGini

where pj is the relative frequency of class j in T.

If a dataset T is split into two subsets T1 and T2 with sizes N1 and N2 respectively,

the gini index of the split data contains examples from n classes, the gini index (T) is

defined as

)()()( 22

11

TginiN

NTgini

N

NTGinisplit +=

The attribute value that provides the smallest SPLITGini (T) is chosen to split the node.

2.3.3 Operation of Random Forest

The working of random forest algorithm is as follows.

1. A random seed is chosen which pulls out at random a collection of

samples from the training dataset while maintaining the class distribution.

2. With this selected data set, a random set of attributes from the original

data set is chosen based on user defined values. All the input variables are

not considered because of enormous computation and high chances of

overfitting.

30

3. In a dataset where M is the total number of input attributes in the dataset,

only R attributes are chosen at random for each tree where R< M.

4. The attributes from this set creates the best possible split using the gini

index to develop a decision tree model. The process repeats for each of the

branches until the termination condition stating that leaves are the nodes

that are too small to split.

Multiple trees constituting a forest are developed with each tree randomly

selecting the attributes in the dataset. When the forest is employed in classification, each

individual tree votes for one class and the forest predicts the class that has the plurality of

votes. At the end of polling results, the class that received the most votes is elected the

most important predictor.

Example:

Each random forest tree is constructed using the CART methodology which uses

the gini index. However in each tree, only some of the total numbers of attributes in the

dataset are utilized. Therefore this tree building procedure is repeatedly used to construct

the remaining trees in the random forest using different sets of attributes. The example

below shows the construction of a single tree using the abridged dataset in Table 2.3.

Only two of the original four attributes are chosen for this tree construction.

31

Table 2.17 Insurance Dataset [UCI2000]

ATTRIBUTES RECORD

HOME_TYPE SALARY

CLASS

1 31 3 1

2 30 1 0

3 6 2 0

4 15 4 1

5 10 4 0

CART splits the nodes by using a binary method where each decision node has

only two distinct child nodes. In this example, the prediction class is to determine the

occurrence of 0 or 1. There are five distinct records with values for two attributes,

HOME_TYPE and SALARY.

Assume that the first attribute to be split is HOME_TYPE attribute. The possible

splits for HOME_TYPE attribute in the left node range from 6 ≤ x <31, where x is the

split value. All the other values at each split form the right child node. The possible splits

for the HOME_TYPE attributes in the dataset are HOME_TYPE ≤ 6, HOME_TYPE ≤

10, HOME_TYPE ≤ 15, HOME_TYPE ≤ 30, and HOME_TYPE ≤ 31. Taking the first

split, the gini index is calculated as follows.

Table 2.18 Partitions after the Binary Split on HOME_TYPE ≤ 6 by the Random Forest

Number of records Attribute

Zero(0) One (1) N = 5

HOME_TYPE ≤ 6 1 0 1n = 1

HOME_TYPE > 6 2 2 2n = 4

32

Then ( )1DGini , ( )2DGini , and SPLITGini are calculated as follows:

0)01(1)6_( 22 =+−=≤TYPEHOMEGini

5.04

2

4

21)6_(

22

=

+

−=>TYPEHOMEGini

4.05.05

40

5

1=×

+×

=SPLITGini

In the next step, the data set at HOME_TYPE ≤ 10 is split and tabulated in Table 2.19.

Table 2.19 Partitions after the Binary Split on HOME_TYPE ≤ 10 by the Random Forest

Number of records Attribute

Zero(0) One (1) N = 5

HOME_TYPE ≤ 10 2 0 1n = 2

HOME_TYPE > 10 1 2 2n = 3

Then ( )1DGini , ( )2DGini , and SPLITGini are calculated as follows:

0)01(1)10_( 22 =+−=≤TYPEHOMEGini

4452.03

2

3

11)10_(

22

=

+

−=>TYPEHOMEGini

2671.04452.05

30

5

2=×

+×

=SPLITGini

Table 2.20 tabulates the gini index value for the HOME_TYPE attribute at all

possible splits.

33

Table 2.20 SPLITGini Calculations for HOME_TYPE Attribute

SPLITGini Value

( )6_ ≤TYPEHOMEGiniSPLIT 0.4000





In Random Forest, the split at which the gini index is lowest is chosen at the split

value. However, since the values of the HOME_TYPE attribute are continuous in nature,

the midpoint of every pair of consecutive values is chosen as the best split point. In this

example, the split HOME_TYPE <= 10 has the lowest value. The best split in our

example, therefore, is at 5.122

1510_ =

+=TYPEHOME instead of

at 10_ ≤TYPEHOME . The decision tree after the first split is shown in Figure 2.11.

Figure 2.11 Partially Constructed Decision Tree by the CART Algorithm

This procedure is repeated for the remaining attributes in the dataset. In this

example, the gini index values of the second attribute SALARY are calculated. The lowest

value of the gini index is chosen as the best split for the attribute. The final decision tree

is shown in Figure 2.12.

HOME_TYPE

CLASS 1

CLASS 0

>12.5 ≤ 12.5

34

Figure 2.12 Fully constructed decision tree by the CART/Random Forest.

The decision rules for the decision tree illustrated in Figure 2.12 are

• If HOME_TYPE ≤ 12.5, then Class value is 0.

• If HOME_TYPE is > 12.5 and SALARY is 3, then Class value is 1.

• If HOME_TYPE is > 12.5 and SALARY is 1/2/4, then Class value is 0.

This is a single tree construction using the CART algorithm. Random forest

follows this same methodology and constructs multiple trees for the forest using different

sets of attributes. Random forest has uses a part of the training set to calculate the model

error rate by an inbuilt error estimate, the out-of-bag (OOB) error estimate. The training

set is split into a pair of training and testing set at 66.7:33.3 ratios. While the new training

set is used to construct, the testing set is used to calculate OOB error estimate. This value

is necessary to choose the best class. The OOB error estimate affirms the model stability

and supplies a weight for each tree. When the forest is employed in classification, each

individual tree votes for one class and the forest predicts the class that has the plurality of

votes. On polling the results, the class that receives the most votes is elected the most

important predictor.

HOME_TYPE

SALARY

CLASS 0

>12.5 ≤ 12.5

CLASS 0

CLASS 1

{1, 2, 4} {3}

35

Table 2.23 Random Tree values for a Single Record.

HOME-TYPE SALARY

15 4

TREE CLASS WEIGHT

Tree 1 1 5

Tree 2 0 5

Tree 3 1 4

Tree 4 1 2

Tree 5 0 4

Tree 6 0 4

Value of Class 1 = 5+4+2=11

Value of Class 0 = 5+4+4=13

Consider a single record in the Insurance Dataset in Table 2.17. The record is

applied by the random forest of trees to determine the record’s class. The chosen class of

the six random trees is mentioned in Table 2.23. The table also contains the weight of a

class in a tree which is determined by the OOB error value. The weighted voting is

conducted by summation of the weight values for the respective class and the class with

the highest value is chosen as the best class. In this example, class 0 is chosen over 1.

This voting procedure is continued for all the cases over the six different random trees.

Finally, the winner is the class which has the most votes among all the cases. This

36

method is repeated for all of the important attributes to form the final random forest

classifier.

2.3.7 Random Forest

To obtain optimum performance from the Random Forest algorithm, three

parameters have to be set at appropriate values in consideration of time and system

resources. The three parameters are

1. Number of trees

2. Random number seed

3. Number of selected features.

The value of the random seed is selected to be 5. It provides enough randomness

in the selection of records in the datasets. In random forest, a larger number of trees

means a better classification rate. However, since large data sets are used in this research,

the number of trees developed is limited to the available system resources. Any larger

value does not improve the accuracy immensely, but utilizes enormous system resources.

Therefore, the maximum allowable of 50 trees is used. Theoretically, after a certain

number of trees, the error rate steadies and varies minimally.

The next parameter is the number of features which is the subset of variables

selected from all variables in the data set. The number was varied from 1 to the maximum

number. [RF2002] suggests that [ ]1log 2 +M provided the best overall performance.

Therefore [ ]1log 2 +M of every data set is used as the number of features for constructing

each decision tree.

37

2.3.8 Summary

Below is the summary of the Random Forest algorithm discussed in this section.

• Random Forest algorithm is an ensemble method of predictive modeling.

• Gini index is used as the splitting criterion among the attributes.

• Large trees or predictors are generated to create a forest where individual

votes are aggregated to predict the classes.

• Random attribute R is chosen to be a parameter where R< M; M is the number

of input variables. The default value of R is (log2M) +1.

• The accuracy and stability of the model is determined by OOB (Out-Of-Bag)

estimates.

• Random Forest algorithm performs very well even with data sets with badly

distributed attributes.

• Random Forest algorithm does not overfit and is robust to noise.

2.4 Error Rate Estimation

To determine a model’s performance, it is important to have a common measure

for all the data mining models. The most common measure is to calculate the error rate

for each algorithm and compare them on the best model. The error rate is calculated by

applying the testing data on the model developed using the training set. The following

section discusses the three main error estimation techniques.

38

2.4.1 Holdout Method

Holdout method is the simplest technique for estimating error rates. To calculate

the error rate, the data set is split into the training data and the testing data in the given

ratio. The training set is used to train the classifier and the testing set is used to estimate

the error rate of the trained classifier. It is not convincing because only one trial is

conducted.

Total number of Examples

Fig 2.13: Holdout Method

2.4.2 Repeated Hold-out Method

The holdout technique can be made more reliable by repeating the process

multiple times by generating different pairs of train and test partitions by splitting the

data set. This method is known as the Repeated hold-out method. At each step, a certain

proportion is randomly selected for thee model to evaluate the error rate. The error rates

generated on the different iterations are averaged to yield an overall error rate. It is easy

to implement and consumes moderate computational resources.

2.4.3 Cross-Validation Method

In K-fold cross-validation, the original sample is partitioned into K sub-samples.

Of the K sub-samples, a single sub-sample is retained as the validation data for testing the

model, and the remaining K − 1 sub-samples are used as training data. The cross-

TESTING SET TRAINING SET

39

validation process is then repeated K times, with each of the K sub-samples used exactly

once as the validation. Then, the error rates are averaged to yield an overall error rate at

the end of K times of execution. Usually a ten-fold cross-validation is implemented where

the average error rate is determined at the completion of 10 executions.

2.5 Evaluation of Algorithm Performance

The error rates of different algorithms are used as a benchmark for comparing

their performance. However, it is not appropriate to compare the error rates of dissimilar

algorithms. A conclusion derived from such comparisons could be deceptive; therefore, it

is essential to conduct a statistical test between the algorithms to determine the

significance. In this research, the statistical two-tailed test method is used to find the

existence of significant differences between error rates of different algorithms.

2.5.1 Two-Tailed Test Comparison

The two-tailed test is the test of a given hypothesis when two samples being

compared are related in some way. Therefore, error rates of two classification algorithms

which use the same dataset can be subjected to the two-tailed test to determine their

statistical significance. The evaluation of the two classification models can be articulated

in the form of a hypothesis [ZHENG2004] as below:

There is no significant difference in the error rate of two classifiers, M1 and M2,

built with the same training data.

The above hypothesis is specifically developed for the classification algorithm

comparison. It is stated to determine the statistical significance of two classifiers under a

40

certain confidence limit. The value T is used to determine the statistical significance

between the two classifiers. This value determines if the hypothesis should be accepted or

not.

( )( )n

qq

EET

21

21

−

−= (2.12)

where:

E1 is the error rate for classifier M1;

E2 is the error rate for classifier M2;

( ) 221 EEq += ;

1n is the number of samples in test set A;

2n is the number of samples in test set B.

Confidence limits are expressed in terms of a confidence coefficient. Although the

choice of confidence coefficient is somewhat arbitrary, in practice 90%, 95%, and 99%

intervals are often used, with 95% being the most commonly used. For example, 95% is

also used in [ZHENG2004] which would be used for comparison.

In order to be 95% confident that the null hypothesis is rejected, the absolute

value of T has to be greater than or equal to 2 when the sample size is greater than 120.

If 2<T , then 95% confidence exits that the acceptance of the null hypothesis is the right

decision and there is no significant difference between the two experiments. If 2≥T ,

then 95% confidence exists that there is a significant difference between the two

classifiers.

41

To elucidate the working of this method, consider two classification models C1

and C2 with the misclassification rates 15% and 20% respectively. They have been tested

on test data which contains 250 instances.

In order to determine the performance difference of the two models, substitute the

value to the formula (2.12)

15.01 =C , 20.02 =C , 250=n

( ) 175.02/20.015.0 =+=q

( ) 825.01 =− q

( )( )( )250/2825.0175.0/20.015.0 −=P

= 1.47

Since the value of T<2, the difference between the model performance is not

considered significant. The conclusion can be made with 95% confidence although the

absolute difference of the error rate is as much as 5%.

2.6 Related Studies

Among the different studies, the StatLog project [STATLOG] was the first and

most comprehensive research to compare classification algorithms. In the study, 23

classification algorithms were compared on 22 real-world datasets with the 10-fold cross-

validation technique. The results concluded that no algorithm can consistently outperform

the other algorithms on all datasets. However, it also found that while neural networks

and statistical algorithms tended to work well on the same kind of datasets, the machine

learning algorithms tended to form a different group performing better in other datasets.

42

The results and detailed analysis provided by the StatLog project are helpful for future

algorithm comparison research.

Dr. Peter Eklund improved upon the Statlog project by adding more datasets and

restricted their algorithm selection to the public domain. Eklund [Eklu02] compared 12

algorithms in various ways, such as average classification accuracy, standard deviation,

and two-tailed t-test on 29 data sets. The algorithms were divided into three groups based

on their performance and accuracy. Group 1 (Decision tree, Neural networks) produce

higher classification accuracy on more data sets than algorithms in Group 2 (C4.5, C4.5

rules, CN2, backpropagation) and Group 3 (k-nearest neighbor, oblique classifier, Q*,

and Radial basis). The algorithms of Group 2 produce higher classification accuracy on

more data sets than algorithms in Group 3.and as in StatLog; no algorithm was

statistically insignificant with other algorithms in the same group. Another important

conclusion is that the learning datasets contain many irrelevant attributes which

undermine the efficiency of all algorithms. The research suggests that an algorithm which

can effectively ignore the irrelevant attributes can perform better.

[SEGAL2004] discusses the benchmarking of the Random Forest algorithm. In

this study, 18 datasets from the UCI repository are used to study the Random forest

algorithm. It supports the evidence that random forests do not overfit and in fact prunes

the tree to achieve superior performance. The algorithm performed well due to its ability

to disregard unnecessary attributes and use of recommended default settings for the

tuning parameters.

All the previous studies researched the performance of each algorithm over

various datasets without involving varied sized training and testing datasets.

43

[ZHENG2004] discusses the algorithm’s performance based on training dataset size.

Each of seven datasets are split into varying eight ratios from 50:50 to 97.5:2.5 where the

former is the training set size and the latter is testing set size. These split datasets are then

used by two algorithms, C4.5 and CART. The results showed that the type of the dataset

is the most important factor affecting decision tree algorithm’s performance. In addition,

for large data sets with more than 1000 samples, better results are achieved at the 80:20

splitting ratio rather than the accepted 66.7:33.3 splitting ratio. Any more increase in the

size leads to unstable model accuracy. Although the two algorithms have similar

accuracy in small sets, they differ in large datasets; therefore, large data sets (>1000) are

more suitable for comparing different algorithms. Since the algorithms used in Zheng’s

research are old, further study on similar but contemporary algorithms would be helpful.

2.7 Summary

This chapter presents the literature review on the background and relevant

research of Classification mining and the two algorithms, YaDT and Random Forest. The

following is the summary of the literature review:

• Classification modeling is used to predict a target categorical variable

which can be partitioned into n number of classes or categories.

• Decision tree is a widely used classification method due to its ease of

development and straightforward interpretability.

• YaDT is a from-scratch main-memory implementation of the C4.5-like

decision tree algorithm, which yields entropy-based decision trees.

44

• Random Forest algorithm is an ensemble method of predictive modeling

which develops CART-based trees.

• Repeated Hold-out method is used to calculate the algorithm’s error rate.

• Two-tailed test is the statistical method to compare two algorithms and

determine their statistical significance.

45

CHAPTER III

RESEARCH METHODOLOGY

This chapter details the work carried out in this thesis research and is organized into

seven basic sections. Section 3.1 introduces the basic methodology of the research.

Section 3.2 discusses the selection criteria of the algorithms. Section 3.3 details the

collection and preparation of the datasets. Section 3.4 explains the construction of the

different models implemented in the research. Section 3.5 emphasizes the relationship of

training data size to error rate. Section 3.6 discusses the statistical comparison of the

algorithms. Section 3.7 summarizes the important topics covered in this chapter.

3.1 Research Methodology

The knowledge discovery process in data mining consists primarily of the

following processes: Data collection, data preparation, model construction and result

analysis. Data collection is the step in determining the right type of data set according to

size, usage, relevance and collecting them from reliable sources. Data preparation is the

process of formatting the data to the algorithm’s specifications and cleaning to avoid

outliers. Model construction consists of training the model to the datasets and testing the

model to obtain the accuracy rate and other statistical results. Result analysis involves the

interpretation of the model results which is used to derive conclusions for the end user.

For the most part, this research adheres to this methodology; however, minor changes

like splitting of datasets and statistical analysis have been made to suit the research work.

46

Figure 3.1 Research Methodology [Adapted from ZHENG2004, Pg. 52]

3.2 Software Selection

The datasets provided are in the form of a continuous format where data can have

any value in an interval of real numbers. Decision tree models like YaDT and Random

Forest which employ data in continuous format are used for this thesis research.

Individual software which encompasses the features of the two algorithms has been

selected in this research. YaDT is the software designed and implemented in C++ with

strong emphasis on efficiency (time/space) and portability (Windows/Linux). It is made

freely available by the architect of the algorithm, Dr. Salvatore Ruggeri of University of

Pisa [YaDT2004]. The software for the Random Forest algorithm is designed from the

RandomForest class available in the WEKA data mining suite [WEKA].

The system specifications for running the two algorithms are specified in Table

3.1.

1. Data Collection

2. Data Cleaning

5. Model Training

6. Model Testing

EFFECT OF TRAINING DATA

SIZE ON MODEL ACCURACY

STATISTICAL COMPARISON

OF THE ALGORITHMS

3. Re-formatting Data set

4. Splitting Data set

RESULTS RESULTS

DATA PREPARATION

MODEL CONSTRUCTION

DATA COLLECTION

47

Table 3.1: Machine Specifications

Specifications YaDT Algorithm Random Forest

Machine Name Home System pleione.hpcc.ttu.edu

Type PC Workstation Cluster

Operating System Microsoft Windows XP Home IRIX

RAM 512 MB 256 MB

CPU Type Intel Celeron 2.6Ghz SGI Origin 2000

3.3 Data Collection

In the majority of research studies, large datasets are preferred over the small

sized data sets (<1000 examples in each data set). This is because the models constructed

with small sets can be flawed and cannot be applied to real-world problems. Therefore

relatively large data sets (>1000 samples in each data set) are required to develop a good

classification model which generates high-quality decision rules.

In order to compare the results obtained in this thesis, the data sets in

[ZHENG2004] are used in this thesis research. In Table 3.2, the major characteristics are

listed while the detailed description including the distribution of classes is listed in

APPENDIX B.

Table 3.2: Characteristics of the selected data sets.

48

Data set

Number

of

Samples

Number of

Attributes

Number of

Classes

Application

area

Chess 28056 6 17 Strategy

games

Connect-4 Opening 67557 42 3 Strategy

games

Letter Recognition 20000 16 17 Statistical

Optical Recognition of

Handwritten Digits

(Opt Digits)

5620 64 10 Multimedia

Pen-Based

Recognition of

Handwritten Digits

(Pen Digits)

10992 16 10 Multimedia

Landsat Satellite

Image 6435 36 6 Spatial

Spambase 4601 57 2 Statistical

3.3.1 Data Preparation

The data sets are available in the desired split ratio 50:50, 60:40, 66.7:33.3, 70:30,

80:20, 90:10, 95:5, and 97.5:2.5 where the first number is the training set and latter is the

testing set. A 50:50 ratio means that 50% of the original dataset is used as a training

dataset while the remaining 50% is employed as a testing dataset. The minimum ratio

selected is 50% because training data with a lesser ratio would not be sufficient to

develop an accurate model and diminishes its effectiveness. The widely applied ratio of

66.7:33.3 is also chosen to maintain the research tradition to construct a model and

49

calculate the accuracy. The final ratio chosen is 97.5:2.5 to provide a reasonable number

of samples in the testing data set.

In [ZHENG2004], on each original dataset, a C# program is run five times to

generate five pairs of training and testing data for each ratio. Each pair generates an error

rate for both the algorithms, YaDT and RandomForest which are averaged to use as the

actual error rate at each specific splitting ratio. For example, five pairs of 90:10 ratios are

generated for the SpamBase dataset and are applied to the RandomForest algorithm to

compute the error rates. The five error rates are averaged to form an overall error rate of

RandomForest for SpamBase dataset at the 90:10 splitting ratio.

The two algorithms required data sets in a specific format which suits their

operations. YaDT required the data sets to be in normal text format as available in the

UCI data repository. However, the Random Forest class in Weka suite mandates the data

sets in ARFF format [WEKA].

3.4 Model Construction

The model construction developed based on the two algorithms YaDT and

Random Forest consists of model training and model testing. Although the functioning of

the two models is different, the general procedure of decision tree development is the

same and, therefore, only YaDT is discussed in the following section.

An important part of the development involved determining the ideal parameter

values of each algorithm to obtain maximum performance. Experiments are conducted in

YaDT to agree on the pruning confidence value and minimum number of cases to split.

Pruning is the method of removing the least reliable branches in the tree and is

50

measured in confidence level from no pruning (0%) to 100%. The second parameter is

the minimum number of cases to be present in each sub-tree. This value specifies the

number of cases required at each node to initiate further splitting. Together the two

parameters determine the size of the tree structure and the model error rate.

While the pruning confidence ranged from no pruning to 100% pruning, the

minimum cases were tried from 2 to 20. Since the error rate is similar in all settings, the

setting with the smallest tree size is chosen. Smaller trees are more stable and therefore

20% pruning with a minimum of 2 cases is unanimously chosen. The results of the Letter

Recognition data set are shown in Table 3.3 while the remaining results are in

APPENDIX A.

Table 3.3 Pruning parameters of YaDT algorithm for Chess data set.

Number

Pruning Confidence

Minimum Number of Cases

Tree Size

Error %

Letter Recognition Data Set

1 No pruning - 3549 3.36

2 20% 2 2395 3.80

3 40% 2 2509 3.51

4 60% 2 2573 3.40

5 80% 2 2597 3.36

6 100% 2 2597 3.36

In the Random Forest algorithm, the calibration of three parameters; the number

of trees, the number of features, and the random seed is performed. During

experimentation, the number of trees was increased until 50, after which no more trees

could be built with the available system resources. Although the selection process of

random seed is minor, it is set to be 5 against the default value of 1. As suggested in

[BRE], the number of attributes for each data set is calculated by ( )1log 2 +M where M is

51

the number of input attributes of the respective data set.

3.4.1 Model Training

The data set is split at given ratios to form the training and testing data sets. The

process of constructing a model depending upon the provided training set is called model

training. The training set is supplied to YaDT (and the RandomForest algorithm) to

construct the model independently. This course of action is followed with model testing.

3.4.2 Model Testing

The process of obtaining the error rate of a model depending upon the provided

testing set is called model testing. This is very important in determining the effectiveness

of the constructed model. The values provided can help in improving the development

process by adjusting the algorithm’s parameters to get the best possible results.

3.5 Relationship of Training data to Error rate

In the previous research work, data sets were commonly split into training data

and testing data with a fixed ratio (66.67:33.33). While the training data is used to

construct the model, the testing data is used to evaluate the model’s accuracy. Most

studies use this ratio (66.67:33.33) citing common practice, but do not provide valuable

insights as to why the actual ratio is preferred. There have been no studies which studies

the relationship of training data size to error rate.

This thesis research concentrates on the effect of training data size to model error

rate. The splitting ratios and model error rates are averaged and plotted in xy graphs with

52

splitting ratios as the x-axis and the error rates as the y-axis for visualization purposes. It

is applied to all the four algorithms to draw model conclusions. The data and plots used

in this study are shown in Chapter IV and the appendices.

3.6 Comparison of four algorithms

The difference in error rates of different algorithms might be large, but it is

important to determine if they are statistically significant. In Chapter II, a statistical

method, the two-tailed test, is introduced which is used to compare the error rate

generated from C4.5, CART, YaDT and RandomForest algorithms. The mathematical

equation of the two-tailed test for comparing two algorithms [ZHENG2004] is

( )( )n

qq

EET

21

21

−

−=

This two-tailed test is conducted against 95% confidence level. If T < 2, then 95%

confidence exists that there is no significant difference between the two algorithms. If T ≥

2, then 95% confidence exists that there is a significant difference between the two

algorithms.

In the previous studies, investigations were carried out to determine the

relationship of training data size to error rate for C4.5 and CART at different splitting

ratios. Following in the same direction, this thesis research includes two additional

algorithms with operations performed on the same datasets to generate quantitative

outcomes. They are thoroughly analyzed in this study to obtain valuable and informative

results in algorithm research as well as in the decision tree model induction.

53

3.7 Summary

The processes consisting of data collection, data preparation, model construction,

and results analysis are the major phases in the data mining process. Essential changes

have been to the procedures to be suitable for this research study. Below is a list of topics

introduced in this research methodology:

1. Software selection. Two contemporary algorithms YaDT and RandomForest

are selected bearing in mind their accuracy, accessibility, and usability.

2. Data collection: The large datasets are collected from the data repository of

Machine repository lab in UCI [UCI2000]. They exhibit essential data

qualities like being large in size, high-quality consistency, and diverse in

nature.

3. Data preparation. All data are re-formatted according to the software

requirements. Then the repeated split technique is applied to split each data set

into five pairs of training and testing data sets at each of the seven suggested

splitting ratios, which generates a large amount of data.

4. Model construction. It consists of the model training and the model testing.

The model testing process generates the error rate that is used for the next two

studies below.

5. Relationship of training data size to error rate. The relationship of training

data size to error rate is shown by plotting the error rates against the seven

splitting ratios.

6. Comparison of the four algorithms. The results of C4.5 and CART along with

the two soft wares are subjected to a statistical test called two-tailed test. This

54

technique is used to measure if the differences between two error rates from

the four algorithms are statistically significant.

55

CHAPTER IV

RESEARCH RESULTS

This chapter presents the results of the experiments described in the previous

chapter. Section 4.1 discusses the data preparation procedures along with splitting results.

Section 4.2 features the results of finding optimal parameter values. Section 4.3

summarizes the error rate results of the two algorithms. Section 4.4 discusses the error

rate variation among the data sets. Section 4.5 details the relationship of the training data

size to error rate in the algorithms. Section 4.6 reviews the results of two tailed test on the

four algorithms to determine their statistical significance.

4.1 Data Preparation

The seven data sets are split into five pairs of training and testing sets at eight

different splitting ratios in [ZHENG2004]. The same data set pairs are used in this

research to compare the results to the previous study. Table 4.1 illustrates the results of

splitting the OptDigits data set. The results of the remaining six data sets are shown in

Appendix D.

Table 4.1 Splitting Results of the Optical Digits Data Set

56

Size of Original Data Set

Suggested Splitting Ratio

Size of Five Training Sets

Size of Five Testing Sets

Avg. Splitting Ratio

5620 50.0% 2810 2810 50.00%

5620 60.0% 3372 2248 60.00%

5620 66.7% 3751 1869 66.74%

5620 70.0% 3935 1685 70.02%

5620 80.0% 4496 1124 80.00%

5620 90.0% 5058 562 90.00%

5620 95.0% 5338 282 94.98%

5620 97.5% 5480 140 97.51%

Each pair is reformatted to suit the specifications of the algorithms used in this

thesis research. The data sets used in YaDT algorithm are represented in the format

described in [YaDT2004]. Since the Random forest algorithm used in this study is part of

the WEKA suite, the datasets are arranged in the arff format as explained in [WEKA].

4.1.1 Optimal Parameter values

The datasets have been modified to suit the requirements of the two algorithms.

The next step is to determine the optimal parameter values for the most effective use of

the algorithms. The results of parameter optimization are discussed in the following

sections.

4.1.1.1 YaDT algorithm

In YaDT algorithm, the parameters whose values need to be optimal are

1. The pruning confidence level (P).

2. The minimum number of cases to split (M)

57

Together the two parameters determine the size of the tree structure and the model

error rate. Since the error rates are similar, the tree with smallest size is selected. Table

4.2 shows the best result of the YaDT algorithm at each pruning confidence level in

OptDigits dataset. The detailed results of other datasets are shown in Appendix D.

Table 4.2 Pruning Results of OptDigits Dataset in YaDT algorithm.

OptDigits Data Set

Number

Pruning Confidence

(P)

Minimum Number of Cases (M)

Tree Size

Error Rate (%)

1 No pruning - 437 1.98797

2 20% 2 315 2.17107

3 40% 2 325 2.04028

4 60% 2 329 2.01413

5 80% 2 333 1.98797

6 100% 2 333 1.98797

The second and third columns are the varying values of the parameters in the

YaDT algorithm. Ex: The minimum cases are two and pruning confidence level is 20%

used to obtain the tree size 315 and error rate 2.17%. While the total tree size of each

model developed is noted in the fourth column, the misclassification rate is denoted in the

next column. Since the error difference is negligible in all the cases, the option with the

smallest tree size is chosen because of more stability. It is evident from the results that

the ideal values for minimum cases for split are two and the pruning confidence level is

20% and they are subsequently used in this research study.

58

4.1.1.2 Random Forest

To obtain optimum performance from Random Forest algorithm, three parameters

have to be set at right values considering time and system resources. The three

parameters are

1. Number of trees

2. Random number seed

3. Number of selected attributes.

The value of the random seed is selected to be 5. It provides enough randomness

in the selection of records in the datasets. Any larger value does not improve the accuracy

immensely, but utilizes enormous system resources. In random forest, a large number of

trees means a better classification rate. However, since large data sets are used in this

research, the number of trees developed is limited to the available system resources.

Therefore, the maximum allowable of 50 trees is used. Theoretically, after a certain

number of trees, the error rate steadies and varies minimally [BRE2001].

The next parameter is the number of selected attributes. It is the choice of no. of

variables selected among the whole variables in the data set. The number was varied from

1 to the maximum number. As suggested in [BRE2001], the number of attributes for each

data set is calculated by ( )1log 2 +M where M is the number of input attributes of the

respective data set.

4.2 Error Rate Generation

With the optimal parameter values and proper formatting, the data sets are used

with the YaDT and Random Forest software. Each pair of training and testing data sets is

59

used to construct and test the model to generate five error rates for each data set at a

splitting rate. The five error rates are averaged in order to reduce research bias. Table 4.3

summarizes the average error rate from all seven data sets by the YaDT and Random

Forest algorithm.

Table 4.3 Average Error Rate from all Data Sets by YaDT and Random Forest

algorithm


Error Rate by YaDT

Error Rate by Random Forest


Error Rate by YaDT



Error Rate by YaDT


Chess Data Set Connect-4 Data Set Pen Recognition Data Set

50.00% 32.72% 37.69% 50.00% 21.53% 19.18% 50.00% 4.84% 1.19%

60.00% 33.94% 35.35% 60.00% 20.95% 18.51% 60.00% 4.15% 1.04%

66.70% 36.22% 33.86% 66.70% 20.19% 18.41% 66.70% 4.27% 1.03%

70.00% 38.19% 33.71% 70.00% 20.43% 18.13% 70.00% 3.90% 1.12%

80.00% 41.16% 31.87% 80.00% 19.81% 18.05% 80.00% 3.72% 0.97%

90.00% 36.62% 30.65% 90.00% 18.96% 17.61% 90.00% 3.27% 1.09%

94.99% 32.08% 30.50% 94.99% 18.72% 17.73% 94.99% 2.90% 0.76%

97.49% 29.90% 28.59% 97.49% 18.57% 16.91% 97.49% 3.63% 0.87%

Letter Recog Data Set Optical Digits Data Set Land Satellite Data Set

50.00% 15.26% 4.86% 50.00% 11.49% 1.98% 50.00% 14.47% 8.91%

60.00% 13.97% 4.31% 60.00% 11.21% 1.80% 60.00% 13.46% 8.59%

66.70% 13.45% 4.23% 66.70% 10.76% 1.97% 66.70% 14.00% 8.70%

70.00% 13.37% 3.93% 70.00% 10.97% 2.10% 70.00% 14.16% 8.48%

80.00% 12.85% 3.79% 80.00% 9.64% 1.65% 80.00% 14.63% 8.40%

90.00% 12.05% 2.91% 90.00% 9.35% 1.28% 90.00% 14.38% 8.19%

94.99% 11.42% 3.04% 94.99% 8.93% 1.41% 94.99% 13.58% 8.78%

97.49% 10.92% 3.56% 97.49% 8.28% 1.42% 97.49% 11.85% 6.41%

SpamBase Data Set

50.01% 8.61% 4.93%

60.01% 7.95% 4.43%

66.70% 8.01% 4.72%

70.01% 8.01% 5.20%

79.98% 7.25% 4.88%

90.00% 7.34% 4.30%

95.00% 8.00% 5.56%

97.50% 8.69% 4.69%

60

The average error rates of the two algorithms across different splitting ratios are

tabulated in Table 2.3. Although the error rates follow the same trend, the absolute

difference in error rate is significant in most datasets. However as in the previous

algorithms, the results adhere to STATLOG research conclusion [STATLOG] that no

algorithm can perform uniformly well on all datasets.

In [ZHENG2004], the performance difference is attributed to three factors

concerning the characteristics of the datasets. The most important factor is the type of the

dataset. The explanation provided is that logic datasets have higher error rates and

decision algorithms are known not to perform well on logic datasets [Bain94]. The

solutions provided to reduce the problem are boosting and increasing the number of

samples. The commonly used technique to improve the performance of decision tree

algorithms on logic data sets is the boosting technique. However, boosting had negligible

change in the error rate when used with the random forest algorithm. Table 4.4 shows the

performance of Random Forest on the Chess data set with the boosting technique.

Table 4.4 Results from the Chess Data Set by Random Forest with Boosting

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training

Set Actual Splitting Ratio Error

Rate Error Rate

Error Rate

Error Rate

Error Rate

Avg. Error Rate

50.00% (no boosting)

37.1640 37.8983 37.7201 37.6346 38.0409 37.69%

50.00% (w/ boosting)

37.1728 37.7682 37.6223 37.6346 38.0396 37.65%

Boosting technique is not available in YaDT algorithm and therefore fake datasets

are constructed mimicking boosting technique. The weaker values are removed from the

61

original training set and a stronger dataset is applied to YaDT algorithm with results

tabulated in Table 4.5. While the original Chess dataset at 50% contains 14027 samples,

the boosted Chess dataset contains 11370 samples.

Table 4.5 Results from the Chess Data Set by YaDT algorithm with Boosting

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training

Set Actual Splitting Ratio Error

Rate Error Rate

Error Rate

Error Rate

Error Rate

Avg. Error Rate

50.00% (no boosting)

33.4046 32.0499 30.9091 34.1533 33.1194 32.72%

50.00% (w/ boosting)

25.6535 25.0931 24.1784 25.9967 24.8361 25.15%

The other two factors are the number of pre-defined classes and the number of

attributes. With the large values of these factors, the size and depth of the resulting

decision tree model are large and during the pruning process, more errors may be made.

All of the mentioned factors are validated by the results of this research. The two

logic datasets, Chess and Connect-4, have higher error rates than the other datasets. The

pen recognition dataset has a small number of pre-defined classes and attributes which

could be responsible for better accuracy than other object recognition datasets. However

in spite of large number of attributes, the statistical dataset provides good prediction

rates. The reason for this performance could be the class distribution which is explained

in the following sections.

The performance difference could also be attributed by investigating the decision

tree algorithms. Decision tree algorithms fall under the type of greedy algorithms by

62

finding the best result at a node at a time. However the solution might not be ideal

because it would not solve the overall problem. This approach affects the construction of

a decision tree which in turn affects the error rate across different datasets. The

performance can be improved by using ensemble method of classification. Random forest

which is a type of ensemble method provides better results than the single tree algorithm,

YaDT. Ensemble methods combine different trees and reduce the bias of each decision

tree to provide better predictions.

The second factor may rely on the class distribution of samples in the dataset.

Optical digits and Pen recognition datasets contain an equal distribution of values within

their respective classes. This allocation helps in tree construction and therefore these two

datasets have the lowest error rates among all datasets. In contrast, the chess dataset has

the most imbalanced class distribution and is the dataset with highest error rates in both

algorithms. Improving the class distribution in the dataset can enhance the accuracy of

decision tree algorithms.

While the type of dataset remains the most important factor, two other factors

have been found to affect the performance of the decision tree algorithms. The numbers

of trees constructed in the algorithm and class distribution are the important factors

determined form this thesis research. Algorithms which augment these factors could

develop more robust and accurate decision trees.

63

4.3 Error Rate Variation

The general error rate generation was discussed in the previous section. The

detailed results of the five error rates in OptDigits dataset by YaDT is illustrated in Table

4.6. It shows that there is error rate variation across the different splitting ratios. To

minimize this variation, each dataset is experimented at a splitting ratio five times and the

averaged value is used as the overall error rate.

Table 4.6 Complete Results of OptDigits dataset in YaDT

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training

Set Actual Splitting Ratio Error Rate Error Rate Error Rate Error Rate Error Rate

Avg. Error Rate

Optical Digits Data Set

50.00% 10.7829 11.7438 12.242 11.1388 11.5658 11.49%

60.00% 11.121 9.83096 11.3879 11.8772 11.8772 11.21%

66.70% 10.7009 11.1289 9.95185 11.1289 10.9149 10.76%

70.00% 10.8012 10.029 10.2077 12.0475 11.8101 10.97%

80.00% 9.07473 9.78648 8.98577 10.2313 10.1423 9.64%

90.00% 9.60854 11.032 8.36299 10.3203 7.47331 9.35%

94.99% 8.86525 8.86525 10.6383 7.44681 8.86525 8.93%

97.49% 5.71429 7.85714 9.28571 10.00 8.57143 8.28%

Since the error rate determines the model performance, a variation affects the

validity of the model prediction. The variation is present in the results of both algorithms

across the seven datasets. However the variation is quite limited in Random Forest. Table

4.7 shows the error rates of OptDigits in Random forest.

64

Table 4.7 Complete Results of OptDigits dataset in Random Forest

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training


Avg. Error Rate


50.00% 1.4947 2.0996 1.9217 2.2776 2.0996 1.98%

60.00% 1.5125 1.8683 1.7794 2.0018 1.8683 1.80%

66.70% 1.8192 2.1402 2.0332 2.3007 1.6051 1.97%

70.00% 2.3145 2.3145 1.9585 2.1365 1.7804 2.10%

80.00% 1.2456 1.5125 1.6904 1.7794 2.0463 1.65%

90.00% 1.9573 1.2456 1.4235 1.0676 0.7117 1.28%

94.99% 1.4184 2.1277 1.0638 1.0638 1.4184 1.41%

97.49% 0 0 1.4286 1.4286 4.2857 1.42%

Tables 4.6 and 4.7 also reveal that as the splitting ratio increases, the error rate

variation also increases. The increasing variation implies that the testing dataset gradually

destabilizes and accuracy of model is not constant. The error variation becomes

especially larger when the splitting ratio is greater than 70%. This trend was also

observed in C4.5 and CART algorithms. Figure 4.1 shows the error rate variation for

each splitting ratio from the Connect-4 dataset by YaDT algorithm.

65

Connect-4 Data Set

15.00%

16.00%

17.00%

18.00%

19.00%

20.00%

21.00%

22.00%

23.00%

24.00%

25.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure 4.1: Error Rate Variation from the Connect-4 Data Set by YaDT

Although error variation is also present in Random Forest, it is considerably lower

than YaDT. In all the datasets, the variation in Random Forest is consistently lower and

error rates are more stable as compared to YaDT. Figure 4.2 illustrates the error rate

variation for each splitting ratio from the Connect-4 dataset by Random Forest.

Connect-4 Data Set

16

16.5

17

17.5

18

18.5

19

19.5

20

1 2 3 4 5 6Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

66

Figure 4.2: Error Rate Variation from the Connect-4 Data Set by Random Forest

In the previous studies, the reason for these patterns are attributed to different sets

of training and testing datasets even though the total number of samples and the class

distribution are same. Models built on the five different training data sets can be different

leading to different error rates. Another reason provided is the decreasing size of testing

dataset in the increasing splitting ratios. The small size may not contain enough samples

to generate consistent error rates. A single error in a 97.5:2.5 splitting ratio affects error

rate more than an error in 50:50 ratio because of the smaller testing set.

The variation in error rate could also be occurring due to other reasons. One

reason could be the lack of data preprocessing. Generally datasets contain noise and

outliers which can distort model analysis. Their presence can skew the model accuracy as

well as create variation in error rates. Since the datasets are not pre-processed, the

unnecessary values could be responsible for error rate variation. Preprocessing the

datasets by removing the outliers and smoothing noisy data before model construction

could significantly improve the stability of the dataset. The second reason could be the

instability of decision tree algorithms used in the research. Decision tree algorithms are

structurally unstable classifiers and produce diversity in classifier decision boundaries.

Small changes in training dataset can result in developing very different models and

splits. This is especially true in small datasets which is validated by the error rate results

in Spambase dataset. To improve stable and accuracy, methods like bagging, boosting are

included to develop prediction models [LAST].

Although the figures and tables in this section show that error rate variation

67

increases with the splitting ratios, the error rate, however, decreases. Withstanding the

error rate variation, it is evident that a model with maximum training set samples

improves the performance of the algorithm. This leads to the important observation about

the relationship of training set size to error rate which is discussed in the next section.

4.4 Relationship of Training Data Size to Error Rate

Eight different models are developed for each dataset with ratios varying form

50:50 to 97.5:2.5 where the former is the training set size and the latter is the testing set

size. While the training set is used to construct the model, the testing set is used to

compute its accuracy. At each splitting ratio of a dataset, the five different error rates are

averaged to form an overall error rate. This is used to reduce the variance of the error

rates. All the calculations and results are tabulated in Appendix A. The figures showing

the relationship of training data size to error rates is provided in Appendix A. Table 4.8

shows the average error rate of all four algorithms in Pen Recognition dataset.

Table 4.8 Average error rate of all four algorithms in Pen Recognition dataset.


Error Rate by C4.5

Error Rate by CART

Error Rate by YaDT


Pen Recognition Data Set

49.97% 4.74% 5.32% 4.84% 1.19%

60.00% 3.92% 4.76% 4.15% 1.04%

66.70% 4.14% 4.50% 4.27% 1.03%

70.01% 3.74% 4.34% 3.90% 1.12%

79.99% 3.68% 4.08% 3.72% 0.97%

89.99% 2.94% 3.76% 3.27% 1.09%

95.00% 2.96% 4.22% 2.90% 0.76%

97.50% 2.98% 3.62% 3.63% 0.87%

68

Pen Digits Dataset

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate

C4.5

CART

YaDT

RForest

Figure 4.3 Relationship of Training Data Size to Error Rate – Pen Digits Data Set

Figure 4.3 is the projection of error rates of C4.5, CART, YaDT and Random

Forest in the Pen recognition dataset. While the error rate of Random Forest is quite

stable, the error rates of the other three algorithms consistently reduce with increase in

training dataset size until the 90:10 ratio. After this ratio, the error rates vary immensely.

Although training set size is important to error computation, [VASEVAND] concludes

that improving the quality of the training set is more important. Future research should

work on this notion.

69

Letter Recog Dataset

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate C4.5

CART

YaDT

RForest

Figure 4.4 Relationship of Training Data Size to Error Rate – Letter Recognition

Data Set

Figure 4.4 shows that while C4.5, CART and YaDT exhibit similar behavior on

most datasets, Random Forest produces better accuracy. The graph shows that the

relationship of training data size to error rate follows a similar trend, and the error rate

fluctuates at the same splitting ratio. The reason is that the three algorithms are developed

with a single decision tree and Random Forest uses multiple trees to develop a more

accurate and stable model. However an unusual pattern is formed by the algorithms for

Chess data which is illustrated in Figure 4.5.

70

Chess Data Set

25.00%

27.00%

29.00%

31.00%

33.00%

35.00%

37.00%

39.00%

41.00%

43.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00% 105.00%

Training Data Set

Eror Rate

C4.5

CART

YaDT

RForest

Figure 4.5 Relationship of Training Data Size to Error Rate – Chess Data Set

Initially, the error rates of YaDT raises with increasing training set size until

80:20. In the remaining ratios, it matches to the rates of other three algorithms. While

decision trees perform badly on logic datasets, the possible reason for this unusual

activity in YaDT could lie in its tree pruning procedure. The extensive pruning could be

responsible for the underfit in this example.

Although the figures show that error rates of C4.5, CART and YaDT are almost

similar in all datasets and widely different to Random Forest, it is important to evaluate

the values statistically. The statistical difference between the algorithms is determined

with the two-tailed test. The test discussed in the next section is the best indicator of

performance comparison.

71

4.5 Statistical Performance Comparison of Four Algorithms

The absolute error rates are not the best indicator of difference in model

performance. Therefore it is important to determine the statistical significance of the

different models used in the research. The two-tailed test introduced in Chapter Two is a

commonly used statistical test for model evaluation. The mathematical equation of the

two-tailed test customized for this research [ZHENG2004] is:

( )( )n

qq

EET

21

21

−

−= (2.12)

E1 is the error rate for classifier M1

E2 is the error rate for classifier M2

( ) 221 EEq += ;

n is the number of records in the testing data set.

The value of T determines the statistical significance between the two algorithms

M1 and M2. The results of C4.5 and CART used for comparison were obtained from

[ZHENG2004]. The statistical computation is conducted once for YaDT and Random

Forest against other algorithms as listed below.

1. YaDT vs. C4.5, CART, Random Forest

2. Random Forest vs. C4.5, CART, YaDT

The statistical test is conducted under the confidence coefficient level of 95%.

Under this level if 2<T , then 95% confidence exists that there is no significant

difference between the two algorithms. However if 2≥T , then 95% confidence exists

that there is a significant difference between the two algorithms.

72

Consider the error results at 95:5 splitting ratio of Connect-4 data set between the

Random Forest and YaDT algorithm.

1. The number of records in the testing data set: 14027=n

2. The error rate for YaDT: 3896.0%96.381 ==E

3. The error rate for Random Forest: 3568.0%68.352 ==E

4. ( ) 3732.0221 =+= EEq

On substituting the values in equation (2.12):

( )( )68.5

21

21 =−

−=

nqq

EET

Since 2>T , 95% confidence exists that there is a significant difference between

C4.5 and CART at the 50:50 splitting ratio.

Table 4.9 Abbreviated table summarizing the results of all data sets by YaDT vs. the

other three algorithms.

73

Split Ratio

Number of Test Samples

Error Rate with YaDT (E1)

Error Rate with C4.5 (E2)

Error Rate with CART (E3)

Error Rate with

Random Forest (E4)

( )( )n

qq

EET

21

21

−

−=

( )( )n

qq

EET

21

31

−

−=

( )( )n

qq

EET

21

41

−

−=

Chess Data Set

90.00 2805 36.62 29.18 29.06 30.65 5.93 6.02 4.732

94.99 1405 32.08 28.80 29.20 30.50 1.88 1.65 0.903

Connect-4 Data Set

90.00 6756 18.96 19.08 30.08 17.61 0.17 15.02 2.0298

95.00 3378 18.72 18.50 31.70 17.73 0.23 12.28 1.0539


50.00 10001 15.26 15.16 17.42 5.20 0.19 4.13 23.474

97.50 500 10.92 11.32 13.28 3.60 0.201 1.14 4.460

Opt Digits Data Set

90.00 562 9.35 8.72 10.08 1.42 0.368 0.41 5.889

Pen Digits Data Set

49.97 5499 4.84 4.74 5.32 1.01 0.245 1.14 11.918

97.50 275 3.63 2.98 3.62 1.01 0.426 0.006 2.040

Landsat Satellite Image Data Set

89.99 644 14.38 14.00 13.80 8.35 0.195 0.299 3.409

97.48 162 11.85 11.36 10.84 6.54 0.137 0.286 1.653

Spambase Data Set

79.98 921 7.25 7.34 7.00 4.84 0.07 0.208 2.170

90.00 460 7.34 7.06 6.84 4.56 0.16 0.295 1.782

Table 4.9 illustrates that YaDT does not have significant statistical difference with

C4.5 and CART algorithms except in lower splitting ratios of large datasets. In the

previous studies, a similar phenomenon was noticed where the single decision tree

building algorithms had little difference between them. The reason for this trend was

attributed to low 21 EE − and less number of samples in all the datasets.

However, Random Forest has noteworthy disparity in statistical performance with

the other three algorithms. The error rates of Random forest are significantly lesser than

the other algorithms. Although low 21 EE − and dataset size could be the factors

contributing to the difference, the most important reason is due to the method of decision

74

tree construction in each algorithm. While only a single decision tree is built in each of

the three algorithms, C4.5, CART and YaDT; Random forest constructs multiple decision

trees and decides the predictor class by voting. This method helps in reducing the error

rate by obtaining larger consensus of the decision trees.

The tree construction method is the single most important factor in the statistical

difference in error rates between the implemented algorithms within the same data sets

and partition ratios. The statistical results also indicate that ensemble learning algorithms

(Random Forest) performs better than single tree building algorithms (C4.5, CART &

YaDT).

4.6 Summary

The relationship of training data size to error rate for four decision tree

algorithms, YaDT, Random Forest, C4.5 and the CART algorithm are studied with seven

data sets. Results show that:

1. For all the four algorithms, the error rate decreases as the training data size

increases. More training data, therefore, can help to improve the model

accuracy.

2. The performance of Random forest algorithm is significantly different than

the other three algorithms in all the data sets except Chess logic data set.

3. YaDT closely follows the performance of its predecessor C4.5 and

outperforms CART at most tests.

75

4. The performance of the four algorithms varies from each data set. Although

they have performed well on the object recognition data set and statistical data

set, their performance on the logic data set is poor and unstable.

5. The single most important factor in the superior performance of Random

Forest as opposed to the other three algorithms is due to the use of multiple

decision trees to determine the final classifier.

6. The class distribution of datasets affects the performance of the decision tree

algorithms.

7. After 70:30 splitting rate, all the four algorithms undergo noticeable variation

in the error rate. The algorithms become especially unstable after the 90:10

ratio.

8. Lack of data preprocessing and general instability of decision trees can result

increasing error rate variation.

9. Statistically, there is no significant difference between YaDT and among the

other two algorithms C4.5 and CART in most scenarios. However there is an

overwhelming difference between Random Forest algorithm and other three

algorithms.

The results lead to several conclusions and also bring up some future research

directions, which will be discussed in Chapter V.

76

CHAPTER V

CONCLUSIONS AND FUTURE RESEARCH

The decision tree is a classification predictive model with significant advantages

over other techniques by being easy to interpret, having quick construction, having high

accuracy and using fewer resources. The decision tree model can be developed by

algorithms like C4.5, CART, YaDT, and Random Forest where their performance is

determined by error rates. This thesis research studies the relationship of training data

size to error rate for the YaDT and Random Forest algorithms, and also compares these

algorithms to the results of C4.5 & CART [ZHENG20004].

5.1 Conclusions

This thesis research has helped in drawing several conclusions as follows:

The error rates consistently reduce to the increasing size of the training set until

the 80:20 ratio. As the size increases, the training set becomes more representative of the

dataset which improves model prediction. The ratio at 80:20 obtains a fine balance in

stability and accuracy. However, at higher ratios, the small number of samples in testing

set can vary the error rate greatly.

The results of the research show that it is very important to set the optimal

parameter values for algorithms to construct the best possible models. For YaDT, ideal

values for minimum cases for splitting are two and the pruning confidence level is 20%.

Similarly the Random Forest utilizes [ ]1log 2 +M number of selected attributes and

constructs 50 trees within the available system resources in this research.

77

The performance difference of the algorithms can be attributed to the greedy

nature of decision tree algorithms. A solution is determined by the decision tree algorithm

at the local level and not in the global perspective. However the results of Random Forest

are stable which implies that by constructing multiple trees, the error rates can be

reduced.

The results of this thesis research show that although the type of dataset is the

most important factor, other factors like number of trees and class distribution also affect

the performance of decision trees. For example Random forest constructs multiple trees

against a single decision tree by C4.5, CART & YaDT to generate better prediction

accuracy. Also, badly distributed class datasets like the Chess dataset generate the highest

error rates in all the algorithms.

The results of error rate generation demonstrate that the error rate of the ensemble

classifier Random Forest is less than that of the single tree classifiers, YaDT, C4.5 and

CART. For example, at a splitting ratio of 90:10 for the Letter Recognition dataset, the

average error rate of Random Forest is 2.91% against 12.04% of C4.5 which is the best

performing among the three single tree algorithms. The overall results show that the

ensemble classifier performs better than single decision tree classifiers in almost all

datasets.

Although pruning is an essential tool for tree construction, it can also skew the

model accuracy due to extensive pruning. This predicament is responsible for the underfit

in error rates of YaDT in the Chess dataset.

The results of the thesis research show that in most cases, there is a significant

statistical difference between Random Forest and other algorithms; YaDT, C4.5 and

78

CART. The tree construction method could be the most important factor for this

difference.

5.2 Future Research

The possible future research directions include:

Methods like data preprocessing and boosting improve the performance of the

algorithms by reducing error rate. The effect of these methods on the training data size to

error rate relationship should be investigated further.

The effect of training data size on error rate on different data mining techniques

like association rule mining, web mining, and text mining needs to be determined by

research.

More studies, such as removal of noisy values, decision cost/benefit analysis, and

pruning methods, are needed to explain the significant performance difference between

Random forest and the other algorithms.

79

BIBLIOGRAPHY

Bain, Mike. http://www.cse.unsw.edu.au/~mike/. PhD Dissertation (1994). Breiman, Leo, Wald, I. Machine Learning, July 2002 Breiman, Leo, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. Decision Tree Web. http://www.mindtools.com/dectree.html Ruggieri, S. Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering. Vol 14, Issue 2, 438-444, March-April 2002. Eklund, Peter. A Performance Survey of Public Domain Supervised Machine Learning Algorithms. Technical report at http://www.kvocentral.com/kvopapers/7paper.pdf Han, Jiawei. Data Mining, Concepts and Techniques: San Francisco, CA: Morgan Kaufmann Publishers, 2004. Bryll, Robert. Attribute bagging: Improving accuracy of classifier ensembles by using Random feature subsets, 2002 Tumer, Kagan. Decimated Input Ensembles for Improved Generalization, 1999 Schapire, Rob. http://www.cs.princeton.edu/courses/archive/spring03/cs511/, 2003 Quinlan, Ross. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Gehrke, Johannes. RainForest – A framework for fast decision tree construction of large datasets, 1998 Roiger, Richard, Geatz, Michael. Data Mining, A Tutorial-based Primer: Addison- Wesley, 2003 Michie, D, Spiegelhalter, D.J, Taylor, C.C. Machine Learning, Neural and Statistical Classification, London, 1994. Segal, Mark, Machine Learning Benchmarks and Random Forest Regression, April 2004 University of California at Irvine Data Repository at ftp://ftp.ics.uci.edu/pub/machine- learning-databases/ Weka Tutorial. http://www.cs.waikato.ac.nz/ml/weka/ , 2003

80

Witten, Ian, Eibe, Frank. Data Mining–Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. Wolpert, David. Stacked Generalization, 1992 Ruggieri, S. YaDT: Yet another Decision Tree builder. Proceedings of the 16th International Conference on Tools with Artificial Intelligence (ICTAI 2004): 260- 265. IEEE Press, November 2004 Zheng, Jeffrey. Study of the relationship of two decision tree algorithms: Texas Tech, July 2004

81

APPENDIX A

PERMISSIONS TO CITE

82

DATA SET INFORMATION – CHESS DATA SET

83

1. Title:

Chess Endgame Database for White King and Rook against Black King (KRK) -

Black-to-move Positions Drawn or Lost in N Moves.

2. Source Information:

-- Creators: Database generated by Michael Bain and Arthur van Hoff at the Turing

Institute, Glasgow, UK.

-- Donor: Michael Bain ([email protected]), AI Lab, Computer Science,

University of New South Wales, Sydney 2052, Australia.

(tel) +61 2 385 3939

(fax) +61 2 663 4576

-- Date: June, 1994.

3. Past Usage:

Chess endgames are complex domains which are enumerable. Endgame databases

are tables of stored game-theoretic values for the enumerated elements (legal positions)

of the domain. The game-theoretic values stored denote whether or not positions are won

for either side, or include also the depth of win (number of moves) assuming minimax-

optimal play. From the point of view of experiments on computer induction such

databases provide not only a source of examples but also an oracle (Roycroft, 1986) for

testing induced rules. However a chess endgame database differs from, say, a relational

database containing details of parts and suppliers in the following important respect. The

combinatorics of computing the required game-theoretic values for individual position

84

entries independently would be prohibitive. Therefore all the database entries are

generated in a single iterative process using the ``standard backup'' algorithm (Thompson,

1986).

A KRK database was described by Clarke (1977). The current database was

described and used for machine learning experiments in Bain (1992; 1994). It should be

noted that our database is not guaranteed correct, but the class distribution is the same as

Clarke's database. In (Bain 1992; 1994) the task was classification of positions in the

database as won for white in a fixed number of moves, assuming optimal play by both

sides. The problem was structured into separate sub-problems by depth-of-win ordered

draw, zero, one, ..., sixteen. When learning depth d all examples at depths > d are used as

negatives. Quinlan (1994) applied Foil to learn a complete and correct solution for this

task.

The typical complexity of induced classifiers in this domain suggests that the task

is demanding when background knowledge is restricted.

4. Relevant Information:

An Inductive Logic Programming (ILP) or relational learning framework is

assumed (Muggleton, 1992). The learning system is provided with examples of chess

positions described only by the coordinates of the pieces on the board. Background

knowledge in the form of row and column differences is also supplied. The relations

necessary to form a correct and concise classifier for the target concept must be

discovered by the learning system (the examples already provide a complete extensional

definition). The task is closely related to Quinlan's (1983) application of ID3 to classify

85

White King and Rook against Black King and Knight (KRKN) positions as lost 2-ply or

lost 3-ply. The framework is similar in that the example positions supply only low-grade

data. An important difference is that additional background predicates of the kind

supplied in the KRKN study via hand-crafted attributes are not provided for this KRK

domain.

5. Number of Instances: 28056

6. Number of Attributes:

There are six attribute variables and one class variable.

7. Attribute Information:

1. White King file (column)

2. White King rank (row)

3. White Rook file

4. White Rook rank

5. Black King file

6. Black King rank

7. optimal depth-of-win for White in 0 to 16 moves, otherwise drawn {draw, zero,

one, two, ..., sixteen}.

8. Missing Attribute Values: None

86

9. Class Distribution:

draw 2796

zero 27

one 78

two 246

three 81

four 198

five 471

six 592

seven 683

eight 1433

nine 1712

ten 1985

eleven 2854

twelve 3597

thirteen 4194

fourteen 4553

fifteen 2166

sixteen 390

Total 28056

References: (BibTeX format)

@incollection{bain_1992,

87

AUTHOR = "M. Bain",

TITLE = "Learning optimal chess strategies",

BOOKTITLE = "{ILP 92}: {P}roc. {I}ntl. {W}orkshop on

{I}nductive {L}ogic {P}rogramming",

YEAR = 1992,

VOLUME = "ICOT TM-1182",

EDITOR = "S. Muggleton",

PUBLISHER = "Institute for New Generation Computer Technology",

ADDRESS = "Tokyo, Japan"}

@phdthesis{bain_1994,

TITLE = "Learning {L}ogical {E}xceptions in {C}hess",

AUTHOR = "M. Bain",

SCHOOL = "University of Strathclyde",

YEAR = "1994"}

@incollection{clarke_1977,

AUTHOR = "M. R. B. Clarke",

TITLE = "A {Q}uantitative {S}tudy of {K}ing and {P}awn

{A}gainst {K}ing",

BOOKTITLE = "Advances in Computer Chess",

VOLUME = 1,

PAGES = "108--118",

88

EDITOR = "M. R. B. Clarke",

PUBLISHER = "Edinburgh University Press",

ADDRESS = "Edinburgh",

YEAR = "1977"}

@incollection{muggleton_1992,

AUTHOR = "S. Muggleton",

TITLE = "Inductive {L}ogic {P}rogramming",

BOOKTITLE = "Inductive {L}ogic {P}rogramming",

PAGES = "3--27",

EDITOR = "S. Muggleton",

PUBLISHER = "Academic Press",

ADDRESS = "London",

YEAR = "1992"}

@incollection{quinlan_1983,

AUTHOR = "J. R. Quinlan",

TITLE = "Learning {E}fficient {C}lassification {P}rocedures and their

{A}pplication to {C}hess {E}nd {G}ames",

YEAR = 1983,

PAGES = "464--482",

BOOKTITLE = "Machine Learning: An Artificial Intelligence

Approach",

89

EDITOR = "R. Michalski and J. Carbonnel and T. Mitchell",

PUBLISHER = "Tioga",

ADDRESS = "Palo Alto, CA"}

@misc{quinlan_1994,

AUTHOR = "J. R. Quinlan",

YEAR = 1994,

NOTE = "Personal Communication"}

@article{roycroft_1986,

AUTHOR = "A. J. Roycroft",

TITLE = "Database ``{O}racles'': {N}ecessary and desirable features",

JOURNAL = "International Computer Chess Association Journal",

YEAR = "1986",

VOLUME = 8,

NUMBER = 2,

PAGES = "100--104"}

@article{thompson_1986,

AUTHOR = "K. Thompson",

TITLE = "Retrograde {A}nalysis of {C}ertain {E}ndgames",

JOURNAL = "International Computer Chess Association Journal",

YEAR = "1986",

90

VOLUME = "8",

NUMBER = "3",

PAGES = "131--139"}

91

DATA SET INFORMATION – CONNECT-4 DATA SET

92

1. Title:

Connect-4 opening database

2. Source Information

a) Original owners of database: John Tromp ([email protected])

b) Donor of database: John Tromp ([email protected])

c) Date received: February 4, 1995

3. Past Usage: not available.


This database contains all legal 8-ply positions in the game of connect-4 in which

neither player has won yet, and in which the next move is not forced.


6. Number of Attributes: 42, each corresponding to one connect-4 square

7. Attribute Information: (x=player x has taken, o=player o has taken, b=blank)

The board is numbered like:

6 * * * * * * * 5 * * * * * * * 4 * * * * * * * 3 * * * * * * * 2 * * * * * * * 1 * * * * * * * a b c d e f g

93

1. a1: {x,o,b}

2. a2: {x,o,b}

3. a3: {x,o,b}

4. a4: {x,o,b}

5. a5: {x,o,b}

6. a6: {x,o,b}

7. b1: {x,o,b}

8. b2: {x,o,b}

9. b3: {x,o,b}

10. b4: {x,o,b}

11. b5: {x,o,b}

12. b6: {x,o,b}

13. c1: {x,o,b}

14. c2: {x,o,b}

15. c3: {x,o,b}

16. c4: {x,o,b}

17. c5: {x,o,b}

18. c6: {x,o,b}

19. d1: {x,o,b}

20. d2: {x,o,b}

21. d3: {x,o,b}

22. d4: {x,o,b}

23. d5: {x,o,b}

24. d6: {x,o,b}

25. e1: {x,o,b}

26. e2: {x,o,b}

27. e3: {x,o,b}

28. e4: {x,o,b}

29. e5: {x,o,b}

30. e6: {x,o,b}

31. f1: {x,o,b}

94

32. f2: {x,o,b}

33. f3: {x,o,b}

34. f4: {x,o,b}

35. f5: {x,o,b}

36. f6: {x,o,b}

37. g1: {x,o,b}

38. g2: {x,o,b}

39. g3: {x,o,b}

40. g4: {x,o,b}

41. g5: {x,o,b}

42. g6: {x,o,b}

43. Class: {win,loss,draw}


9. Class Distribution: 44473 win(65.83%), 16635 loss(24.62%), 6449 draw(9.55%)

95

DATA SET INFORMATION – LETTER IMAGE

RECOGNITION DATA SET

96

1. Title:

Letter Image Recognition Data

2. Source Information

-- Creator: David J. Slate

-- Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201

-- Donor: David J. Slate ([email protected]) (708) 491-3867

-- Date: January, 1991

3. Past Usage:

-- P. W. Frey and D. J. Slate (Machine Learning Vol 6 #2 March 91):

"Letter Recognition Using Holland-style Adaptive Classifiers".

The research for this article investigated the ability of several variations of

Holland-style adaptive classifier systems to learn to correctly guess the letter categories

associated with vectors of 16 integer attributes extracted from raster scan images of the

letters. The best accuracy obtained was a little over 80%. It would be interesting to see

how well other methods do with the same data.


The objective is to identify each of a large number of black-and-white rectangular

pixel displays as one of the 26 capital letters in the English alphabet. The character

images were based on 20 different fonts and each letter within these 20 fonts was

97

randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was

converted into 16 primitive numerical attributes (statistical moments and edge counts)

which were then scaled to fit into a range of integer values from 0 through 15. We

typically train on the first 16000 items and then use the resulting model to predict the

letter category for the remaining 4000. See the article cited above for more details.


6. Number of Attributes: 17 (Letter category and 16 numeric features)


1. lettr capital letter (26 values from A to Z)

2. x-box horizontal position of box (integer)

3. y-box vertical position of box (integer)

4. width width of box (integer)

5. high height of box (integer)

6. onpix total # on pixels (integer)

7. x-bar mean x of on pixels in box (integer)

8. y-bar mean y of on pixels in box (integer)

9. x2bar mean x variance (integer)

10. y2bar mean y variance (integer)

11. xybar mean x y correlation (integer)

12. x2ybr mean of x * x * y (integer)

98

13. xy2br mean of x * y * y (integer)

14. x-ege mean edge count left to right (integer)

15. xegvy correlation of x-ege with y (integer)

16. y-ege mean edge count bottom to top (integer)

17. yegvx correlation of y-ege with x (integer)



789 A 766 B 736 C 805 D 768 E 775 F 773 G

734 H 755 I 747 J 739 K 761 L 792 M 783 N

753 O 803 P 783 Q 758 R 748 S 796 T 813 U

764 V 752 W 787 X 786 Y 734 Z

99

DATA SET INFORMATION – OPTICAL RECOGNITION

OF HANDWRITTEN DIGITS

100

1. Title of Database:

Optical Recognition of Handwritten Digits

2. Source:

E. Alpaydin, C. Kaynak

Department of Computer Engineering

Bogazici University, 80815 Istanbul Turkey

[email protected]

July 1998

3. Past Usage:

C. Kaynak (1995) Methods of Combining Multiple Classifiers and

TheirApplications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate

Studies in Science and Engineering, Bogazici University.

E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika, to

appear.ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z


We used preprocessing programs made available by NIST to extract normalized

bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30

contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided

into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block.

101

This generates an input matrix of 8x8 where each element is an integer in the range 0..16.

This reduces dimensionality and gives invariance to small distortions.

For info on NIST preprocessing routines, see:

M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A.

Janet, and C. L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR

5469, 1994.

5. Number of Instances

optdigits.tra Training 3823

optdigits.tes Testing 1797

The way we used the dataset was to use half of training for actual training, one-

fourth for validation and one-fourth for writer-dependent testing. The test set was used

for writer-independent testing and is the actual quality measure.

6. Number of Attributes

64 input+1 class attribute

7. For Each Attribute:

All input attributes are integers in the range 0..16.

The last attribute is the class code 0..9


102

9. Class Distribution

Class: No. of examples in training set

0: 376

1: 389

2: 380

3: 389

4: 387

5: 376

6: 377

7: 387

8: 380

9: 382

Class: No. of examples in testing set

0: 178

1: 182

2: 177

3: 183

4: 181

5: 182

6: 181

7: 179

103

8: 174

9: 180

Accuracy on the testing set with k-nn using Euclidean distance as the metric

k = 1 : 98.00

k = 2 : 97.38

k = 3 : 97.83

k = 4 : 97.61

k = 5 : 97.89

k = 6 : 97.77

k = 7 : 97.66

k = 8 : 97.66

k = 9 : 97.72

k = 10 : 97.55

k = 11 : 97.89

104

DATA SET INFORMATION – PEN BASED

RECOGNITION OF HANDWRITTEN DIGITS

105

1. Title of Database:

Pen-Based Recognition of Handwritten Digits

2. 2. Source:

E. Alpaydin, Fevzi. Alimoglu

Department of Computer Engineering

Bogazici University, 80815 Istanbul Turkey

[email protected]

July 1998

3. Past Usage:

F. Alimoglu (1996) Combining Multiple Classifiers for Pen-Based Handwritten

Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering,

Bogazici University. http://www.cmpe.boun.edu.tr/~alimoglu/alimoglu.ps.gz

F. Alimoglu, E. Alpaydin, "Methods of Combining Multiple Classifiers Based on

Different Representations for Pen-based Handwriting Recognition," Proceedings of the

Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN

96), June 1996, Istanbul, Turkey. http://www.cmpe.boun.edu.tr/~alimoglu/tainn96.ps.gz


We create a digit database by collecting 250 samples from 44 writers. The

samples written by 30 writers are used for training, cross-validation and writer dependent

106

testing, and the digits written by the other 14 are used for writer independent testing. This

database is also available in the UNIPEN format.

We use a WACOM PL-100V pressure sensitive tablet with an integrated LCD

display and a cordless stylus. The input and display areas are located in the same place.

Attached to the serial port of an Intel 486 based PC, it allows us to collect handwriting

samples. The tablet sends $x$ and $y$ tablet coordinates and pressure level values of the

pen at fixed time intervals (sampling rate) of 100 milliseconds.

These writers are asked to write 250 digits in random order inside boxes of 500 by

500 tablet pixel resolution. Subjects are monitored only during the first entry screens.

Each screen contains five boxes with the digits to be written displayed above. Subjects

are told to write only inside these boxes. If they make a mistake or are unhappy with

their writing, they are instructed to clear the content of a box by using an on-screen

button. The first ten digits are ignored because most writers are not familiar with this type

of input devices, but subjects are not aware of this.

In our study, we use only ($x, y$) coordinate information. The stylus pressure

level values are ignored. First we apply normalization to make our representation

invariant to translations and scale distortions. The raw data that we capture from the

tablet consist of integer values between 0 and 500 (tablet input box resolution). The new

coordinates are such that the coordinate which has the maximum range varies between 0

and 100. Usually $x$ stays in this range, since most characters are taller than they are

wide.

In order to train and test our classifiers, we need to represent digits as constant

length feature vectors. A commonly used technique leading to good results is resampling

107

the ( x_t, y_t) points. Temporal resampling (points regularly spaced in time) or spatial

resampling (points regularly spaced in arc length) can be used here.

Raw point data are already regularly spaced in time but the distance between them

is variable. Previous research showed that spatial resampling to obtain a constant number

of regularly spaced points on the trajectory yields much better performance, because it

provides a better alignment between points. Our resampling algorithm uses simple linear

interpolation between pairs of points. The resampled digits are represented as a sequence

of T points ( x_t, y_t )_{t=1}^T, regularly spaced in arc length, as opposed to the input

sequence, which is regularly spaced in time.

So, the input vector size is 2*T, two times the number of points resampled. We

considered spatial resampling to T=8,12,16 points in our experiments and found that T=8

gave the best trade-off between accuracy and complexity.

5. Number of Instances

pendigits.tra Training 7494

pendigits.tes Testing 3498

The way we used the dataset was to use first half of training for actual training,

one-fourth for validation and one-fourth for writer-dependent testing. The test set was

used for writer-independent testing and is the actual quality measure.

6. Number of Attributes

16 inputs + 1 class attribute

108

7. For Each Attribute:

All input attributes are integers in the range 0..100.

The last attribute is the class code 0..9


9. Class Distribution

Class: No. of examples in training set

0: 780

1: 779

2: 780

3: 719

4: 780

5: 720

6: 720

7: 778

8: 719

9: 719

Class: No. of examples in testing set

0: 363

1: 364

2: 364

3: 336

109

4: 364

5: 335

6: 336

7: 364

8: 336

9: 336

Accuracy on the testing set with k-nn using Euclidean distance as the metric

k = 1 : 97.74

k = 2 : 97.37

k = 3 : 97.80

k = 4 : 97.66

k = 5 : 97.60

k = 6 : 97.57

k = 7 : 97.54

k = 8 : 97.54

k = 9 : 97.46

k = 10 : 97.48

k = 11 : 97.34

110

DATA SET INFORMATION – LANDSAT SATELLITE

IMAGE DATA SET

111

1. FILE NAMES

sat.trn - training set

sat.tst - test set

!!! NB. DO NOT USE CROSS-VALIDATION WITH THIS DATA SET !!!

Just train and test only once with the above training and test sets.

2. PURPOSE

The database consists of the multi-spectral values of pixels in 3x3

neighbourhoods in a satellite image, and the classification associated with the central

pixel in each neighbourhood. The aim is to predict this classification, given the multi-

spectral values. In the sample database, the class of a pixel is coded as a number.

3. PROBLEM TYPE

Classification

4. AVAILABLE

This database was generated from Landsat Multi-Spectral Scanner image data.

These and other forms of remotely sensed imagery can be purchased at a price from

relevant governmental authorities. The data is usually in binary form, and distributed on

magnetic tape(s).

5. SOURCE

The small sample database was provided by:

Ashwin Srinivasan

112

Department of Statistics and Modelling Science

University of Strathclyde

Glasgow

Scotland, UK

6. ORIGIN

The original Landsat data for this database was generated from data purchased

from NASA by the Australian Centre for Remote Sensing, and used for research at:

The Centre for Remote Sensing

University of New South Wales

Kensington, PO Box 1

NSW 2033, Australia.

The sample database was generated taking a small section (82 rows and 100

columns) from the original data. The binary values were converted to their present ASCII

form by Ashwin Srinivasan.

The classification for each pixel was performed on the basis of an actual site visit

by Ms. Karen Hall, when working for Professor John A. Richards, at the Centre for

Remote Sensing at the University of New South Wales, Australia. Conversion to 3x3

neighborhoods and splitting into test and training sets was done by Alistair Sutherland.

7. HISTORY

The Landsat satellite data is one of the many sources of information available for

a scene. The interpretation of a scene by integrating spatial data of diverse types and

113

resolutions including multispectral and radar data, maps indicating topography, land use

etc. is expected to assume significant importance with the onset of an era characterized

by integrative approaches to remote sensing (for example, NASA's Earth Observing

System commencing this decade). Existing statistical methods are ill-equipped for

handling such diverse data types. Note that this is not true for Landsat MSS data

considered in isolation (as in this sample database). This data satisfies the important

requirements of being numerical and at a single resolution, and standard maximum-

likelihood classification performs very well. Consequently, for this data, it should be

interesting to compare the performance of other methods against the statistical approach.

8. DESCRIPTION

One frame of Landsat MSS imagery consists of four digital images of the same

scene in different spectral bands. Two of these are in the visible region (corresponding

approximately to green and red regions of the visible spectrum) and two are in the (near)

infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to

white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x

3380 such pixels.

The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each

line of data corresponds to a 3x3 square neighborhood of pixels completely contained

within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands

(converted to ASCII) of each of the 9 pixels in the 3x3 neighborhood and a number

indicating the classification label of the central pixel.

114

The number is a code for the following classes:

Number Class

1 red soil

2 cotton crop

3 grey soil

4 damp grey soil

5 soil with vegetation stubble

6 mixture class (all types present)

7 very damp grey soil

NB. There are no examples with class 6 in this dataset.

The data is given in random order and certain lines of data have been removed so

you cannot reconstruct the original image from this dataset.

In each line of data the four spectral values for the top-left pixel are given first

followed by the four spectral values for the top-middle pixel and then those for the top-

right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom.

Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and

20. If you like you can use only these four attributes, while ignoring the others. This

avoids the problem which arises when a 3x3 neighborhood straddles a boundary.

115

9. NUMBER OF EXAMPLES

training set 4435

test set 2000

10. NUMBER OF ATTRIBUTES

36 (= 4 spectral bands x 9 pixels in neighborhood )

11. ATTRIBUTES

The attributes are numerical, in the range 0 to 255.

12. CLASS

There are 6 decision classes: 1,2,3,4,5 and 7.

NB. There are no examples with class 6 in this dataset-they have all been

removed because of doubts about the validity of this class.

13. AUTHOR

Ashwin Srinivasan

Department of Statistics and Data Modeling

University of Strathclyde

Glasgow

Scotland, UK

[email protected]

116

DATA SET INFORMATION – SPAM EMAIL DATA SET

117

1. Title:

SPAM E-mail Database

2. 2. Sources:

(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt

Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304

(b) Donor: George Forman ([email protected] ) 650-857-7835

(c) Generated: June-July 1999

3. Past Usage:

(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.

(b) Determine whether a given email is spam or not.

(c) ~7% misclassification error.

False positives (marking good mail as spam) are very undesirable. If we insist on

zero false positives in the training/testing set, 20-25% of the spam passed through the

filter.


The "spam" concept is diverse: advertisements for products/web sites, make

money fast schemes, chain letters, pornography, etc. Our collection of spam e-mails came

from our postmaster and individuals who had filed spam. Our collection of non-spam e-

mails came from filed work and personal e-mails, and hence the word 'george' and the

area code '650' are indicators of non-spam. These are useful when constructing a

118

personalized spam filter. One would either have to blind such non-spam indicators or get

a very wide collection of non-spam to generate a general purpose spam filter. For

background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of

the ACM, 41(8):74-83, 1998.

5. Number of Instances: 4601 (1813 Spam = 39.4%)

6. Number of Attributes: 58 (57 continuous, 1 nominal class label)


The last column of 'spambase.data' denotes whether the e-mail was considered

spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate

whether a particular word or character was frequently occurring in the e-mail. The run-

length attributes (55-57) measure the length of sequences of consecutive capital letters.

For the statistical measures of each attribute, see the end of this file. Here are the

definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of

words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in

the e-mail) / total number of words in e-mail. A "word" in this case is any string of

alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of

characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurrences) /

119

total characters in e-mail 1 continuous real [1,...] attribute of type

capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest = length

of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total = sum of

length of uninterrupted sequences of capital letters= total number of capital letters in the

e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was

considered spam (1) or not (0), i.e. unsolicited commercial e-mail.



Spam 1813 (39.4%)

Non-Spam 2788 (60.6%)

120

TABLES

121

Table B.1: Splitting Results of the Connect-4 Data Set




Five Actual Splitting Ratios


67557 50.0% 33779 50.00% 50.00%

67557 60.0% 40534 60.00% 60.00%

67557 66.7% 45060 66.70% 66.70%

67557 70.0% 47290 70.00% 70.00%

67557 80.0% 54045 80.00% 80.00%

67557 90.0% 60801 90.00% 90.00%

67557 95.0% 64179 95.00% 95.00%

67557 97.5% 65868 97.50% 97.50%

Table B.2: Splitting Results of the Letter Recognition Data Set






20000 50.0% 9999 50.00% 50.00%

20000 60.0% 12001 60.01% 60.01%

20000 66.7% 13342 66.71% 66.71%

20000 70.0% 14000 70.00% 70.00%

20000 80.0% 15998 79.99% 79.99%

20000 90.0% 18000 90.00% 90.00%

20000 95.0% 19000 95.00% 95.00%

20000 97.5% 19500 97.50% 97.50%

122

Table B.3: Splitting Results of the Optical Digits Data Set






5620 50.0% 2810 50.00% 50.00%

5620 60.0% 3372 60.00% 60.00%

5620 66.7% 3751 66.74% 66.74%

5620 70.0% 3935 70.02% 70.02%

5620 80.0% 4496 80.00% 80.00%

5620 90.0% 5058 90.00% 90.00%

5620 95.0% 5338 94.98% 94.98%

5620 97.5% 5480 97.51% 97.51%

Table B.4: Splitting Results of the Pen Based Digits Data Set






10992 50.0% 5493 49.97% 49.97%

10992 60.0% 6595 60.00% 60.00%

10992 66.7% 7332 66.70% 66.70%

10992 70.0% 7696 70.01% 70.01%

10992 80.0% 8793 79.99% 79.99%

10992 90.0% 9892 89.99% 89.99%

10992 95.0% 10442 95.00% 95.00%

10992 97.5% 10717 97.50% 97.50%

123

Table B.5: Splitting Results of the Land Satellite Image Data Set






6435 50.0% 3217 49.99% 49.99%

6435 60.0% 3862 60.02% 60.02%

6435 66.7% 4294 66.73% 66.73%

6435 70.0% 4505 70.01% 70.01%

6435 80.0% 5147 79.98% 79.98%

6435 90.0% 5791 89.99% 89.99%

6435 95.0% 6114 95.01% 95.01%

6435 97.5% 6273 97.48% 97.48%

Table B.6: Splitting Results of the SpamBase Data Set






4601 50.0% 2301 50.01% 50.01%

4601 60.0% 2761 60.01% 60.01%

4601 66.7% 3069 66.70% 66.70%

4601 70.0% 3221 70.01% 70.01%

4601 80.0% 3680 79.98% 79.98%

4601 90.0% 4141 90.00% 90.00%

4601 95.0% 4371 95.00% 95.00%

4601 97.5% 4486 97.50% 97.50%

124

Table B.7: Splitting Results of the Chess Data Set

Size of Original Data

Set Suggested

Splitting Ratio Size of Five Training Sets



28056 50.0% 14029 50.00% 50.00%

28056 60.0% 16835 60.00% 60.00%

28056 66.7% 18714 66.70% 66.70%

28056 70.0% 19639 70.00% 70.00%

28056 80.0% 22445 80.00% 80.00%

28056 90.0% 25251 90.00% 90.00%

28056 95.0% 26651 94.99% 94.99%

28056 97.5% 27353 97.49% 97.49%

Table B.8: Pruning Results of YaDT algorithm

Number

Pruning Confidence

Minimum Number of Cases

Tree Size

Error % Error Rate/Tree

Size

Chess Data Set

1 No pruning - 14327 12.0473 0.0008

2 20% 2 8383 15.2053 0.0018

3 40% 2 9551 13.0632 0.0013

4 60% 2 10141 12.4679 0.0012

5 80% 2 10455 12.2006 0.0011

6 100% 2 10533 12.1471 0.0011

Connect-4 Data Set

1 No pruning - 23845 8.09835 0.0003

2 20% 2 5464 13.3621 0.002

3 40% 2 8575 10.8249 0.001

4 60% 2 11566 9.31806 0.0008

5 80% 2 14011 8.40031 0.0005

6 100% 2 14653 8.23897 0.0005


1 No pruning - 425 0.787297 0.001

2 20% 2 287 1.08086 0.0037

3 40% 2 301 0.947425 0.0031

4 60% 2 331 0.827329 0.002

5 80% 2 343 0.787297 0.002

6 100% 2 343 0.787297 0.002


1 No pruning - 3549 3.36 0.0009

125

2 20% 2 2395 3.8 0.0015

3 40% 2 2509 3.51 0.0013

4 60% 2 2573 3.4 0.0013

5 80% 2 2597 3.365 0.0012

6 100% 2 2597 3.365 0.0012


1 No pruning - 437 1.98797 0.004

2 20% 2 315 2.17107 0.0067

3 40% 2 325 2.04028 0.0062

4 60% 2 329 2.01413 0.0061

5 80% 2 333 1.98797 0.005

6 100% 2 333 1.98797 0.005

Land Satellite Data Set

1 No pruning - 621 2.02931 0.003

2 20% 2 399 2.88613 0.007

3 40% 2 445 2.36753 0.005

4 60% 2 487 2.07441 0.004

5 80% 2 487 2.07441 0.004

6 100% 2 493 2.05186 0.004

SpamBase Data Set

1 No pruning - 499 1.67355 0.003

2 20% 2 203 2.86894 0.014

3 40% 2 269 2.19517 0.008

4 60% 2 343 1.67355 0.004

5 80% 2 343 1.67355 0.004

6 100% 2 343 1.67355 0.004

126

Table B.9: Average Error Rates from all datasets in YaDT algorithm

Actual Splitting Ratio

Average Error Rate


Average Error Rate


Average Error Rate

Chess Data Set Letter Recognition Data Set Optical Digits Data Set

50.00% 32.72% 50.00% 15.26% 50.00% 11.49%

60.00% 33.94% 60.01% 13.97% 60.00% 11.21%

66.70% 36.22% 66.71% 13.45% 66.74% 10.76%

70.00% 38.19% 70.00% 13.37% 70.02% 10.97%

80.00% 41.16% 79.99% 12.85% 80.00% 9.64%

90.00% 36.62% 90.00% 12.05% 90.00% 9.35%

94.99% 32.08% 95.00% 11.42% 94.98% 8.93%

97.49% 29.90% 97.50% 10.92% 97.51% 8.28%

Connect-4 Data Set Pen-based Digits Data Set Satellite Image Data Set

50.00% 21.53% 49.97% 4.84% 49.99% 14.47%

60.00% 20.95% 60.00% 4.15% 60.02% 13.46%

66.70% 20.19% 66.70% 4.27% 66.73% 14.00%

70.00% 20.43% 70.01% 3.90% 70.01% 14.16%

80.00% 19.81% 79.99% 3.72% 79.98% 14.63%

90.00% 18.96% 89.99% 3.27% 89.99% 14.38%

95.00% 18.72% 95.00% 2.90% 95.01% 13.58%

97.50% 18.57% 97.50% 3.63% 97.48% 11.85%

Spambase Data Set

50.01% 50.01%

60.01% 60.01%

66.70% 66.70%

70.01% 70.01%

79.98% 79.98%

90.00% 90.00%

95.00% 95.00%

97.50% 97.50%

62

127

Table B.10: Average Error Rate from all Data Sets by Random Forest


Average Error Rate


Average Error Rate


Average Error Rate

Chess Data Set Letter Recognition Data Set Opt Digits Data Set

50.00% 37.6915 50.00% 4.86% 50.00% 1.98%

60.00% 35.3462 60.01% 4.31% 60.00% 1.80%

66.70% 33.8621 66.71% 4.23% 66.74% 1.97%

70.00% 33.7151 70.00% 3.93% 70.02% 2.10%

80.00% 31.8766 79.99% 3.79% 80.00% 1.65%

90.00% 30.6595 90.00% 2.91% 90.00% 1.28%

94.99% 30.5053 95.00% 3.04% 94.98% 1.41%

97.49% 28.5917 97.50% 3.56% 97.51% 1.42%

Connect-4 Data Set Pen Digits Data Set Satellite Image Data Set

50.00% 19.1752 49.97% 1.19% 49.99% 8.91%

60.00% 18.5131 60.00% 1.04% 60.02% 8.59%

66.70% 18.4069 66.70% 1.03% 66.73% 8.70%

70.00% 18.1349 70.01% 1.12% 70.01% 8.48%

80.00% 18.0491 79.99% 0.97% 79.98% 8.40%

90.00% 17.6139 89.99% 1.09% 89.99% 8.19%

95.00% 17.7323 95.00% 0.76% 95.01% 8.78%

97.50% 16.9094 97.50% 0.87% 97.48% 6.41%

Spambase Data Set

50.01% 50.01%

60.01% 60.01%

66.70% 66.70%

70.01% 70.01%

79.98% 79.98%

90.00% 90.00%

95.00% 95.00%

97.50% 97.50%

128

Table B.11: Complete error rate results of YaDT algorithm

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training


Avg. Error Rate

Chess Data Set

50.00% 33.4046 32.0499 30.9091 34.1533 33.1194 32.72%

60.00% 33.8264 34.8066 34.3254 33.2917 33.4521 33.94%

66.70% 36.3431 36.058 35.2026 36.3788 37.151 36.22%

70.00% 38.722 37.7685 37.6526 39.2389 37.5813 38.19%

80.00% 40.586 41.249 42.5109 41.2918 40.1797 41.16%

90.00% 36.6731 36.9407 36.5982 36.3091 36.566 36.62%

94.99% 30.0356 31.5302 31.8861 33.452 33.5231 32.08%

97.49% 30.5832 30.7255 29.872 30.441 27.8805 29.90%

Connect-4 Data Set

50.00% 22.2956 21.1528 21.0314 22.0854 21.0758 21.53%

60.00% 21.4336 20.4566 20.7416 20.823 21.3004 20.95%

66.70% 20.0827 20.4472 20.0871 20.0382 20.2783 20.19%

70.00% 20.6987 20.2102 21.2316 19.8845 20.1362 20.43%

80.00% 19.405 19.5456 19.849 20.7001 19.5604 19.81%

90.00% 18.5761 20.3375 18.3393 18.7833 18.7389 18.96%

94.99% 19.8342 17.762 19.0941 18.1172 18.7981 18.72%

97.49% 17.8804 19.1237 18.6501 18.8869 18.2948 18.57%


50.00% 15.1785 14.8085 15.0085 16.0684 15.2385 15.26%

60.00% 13.3017 13.8392 14.5018 14.6018 13.6267 13.97%

66.70% 13.3073 13.7128 13.6398 13.6978 12.9318 13.45%

70.00% 12.85 13.9167 13.1333 14.0833 12.8833 13.37%

80.00% 12.8936 12.8436 12.2689 12.8686 13.4216 12.85%

90.00% 12.5 11.65 12.4 11.35 12.35 12.05%

94.99% 11.1 11.5 11.9 10.9 11.7 11.42%

97.49% 11.0 11.8 11.8 10.2 9.8 10.92%


50.00% 10.7829 11.7438 12.242 11.1388 11.5658 11.49%

60.00% 11.121 9.83096 11.3879 11.8772 11.8772 11.21%

66.70% 10.7009 11.1289 9.95185 11.1289 10.9149 10.76%

70.00% 10.8012 10.029 10.2077 12.0475 11.8101 10.97%

80.00% 9.07473 9.78648 8.98577 10.2313 10.1423 9.64%

90.00% 9.60854 11.032 8.36299 10.3203 7.47331 9.35%

94.99% 8.86525 8.86525 10.6383 7.44681 8.86525 8.93%

97.49% 5.71429 7.85714 9.28571 10.00 8.57143 8.28%

129

Pen-based Digits Data Set

50.00% 5.12821 4.23713 4.8918 4.8918 5.09184 4.84%

60.00% 4.13919 4.00273 4.48033 4.11644 4.02547 4.15%

66.70% 3.63388 3.96175 4.72678 4.89071 4.18033 4.27%

70.00% 4.33859 4.03519 3.91384 3.54976 3.70146 3.90%

80.00% 3.00136 4.00182 3.91087 3.95634 3.77444 3.72%

90.00% 2.90909 3.09091 3.36364 3.36364 3.63636 3.27%

94.99% 3.09091 3.45455 4.36364 2.72727 0.872886 2.90%

97.49% 2.54545 3.27273 4.36364 5.45455 2.54545 3.63%

Land Satellite Image Data Set

50.00% 15.9938 13.1988 14.1304 14.5963 14.441 14.47%

60.00% 13.9752 13.1988 13.7422 12.8106 13.587 13.46%

66.70% 14.3005 13.4197 13.7306 14.5596 13.9896 14.00%

70.00% 13.7971 15.8181 13.1364 13.6028 14.4578 14.16%

80.00% 14.7607 13.8595 14.5121 15.6619 14.3567 14.63%

90.00% 14.6193 14.2924 14.6193 14.2457 14.1523 14.38%

94.99% 14.0187 13.3956 13.7072 13.7072 13.0841 13.58%

97.49% 13.5802 9.87654 10.4938 12.3457 12.963 11.85%

SpamBase Data Set

50.00% 9.34783 8.00 8.00 8.08696 9.65217 8.61%

60.00% 7.82609 9.29348 7.11957 7.22826 8.31522 7.95%

66.70% 8.28982 7.8329 7.63708 7.44125 8.87729 8.01%

70.00% 8.98551 8.47826 7.6087 8.18841 6.81159 8.01%

80.00% 7.16612 6.84039 7.60043 6.73181 7.92617 7.25%

90.00% 7.6087 6.52174 7.17391 8.69565 6.73913 7.34%

94.99% 7.3913 10.00 6.52174 8.26087 7.82609 8.00%

97.49% 6.95652 9.56522 10.4348 6.95652 9.56522 8.69%

Table B.12: Complete error rate results of Random Forest algorithm

1st Training

Set

2nd Training

Set

3rd Training

Set

4th Training

Set

5th Training


Avg. Error Rate

Chess Data Set

50.00% 37.1640 37.8983 37.7201 37.6346 38.0409 37.6915

60.00% 36.0574 35.3712 35.0236 35.2375 35.0414 35.3462

66.70% 34.1040 34.3824 33.7936 33.5153 33.5153 33.8621

130

70.00% 33.4323 33.7294 33.5393 34.1095 33.7650 33.7151

80.00% 31.6521 33.0244 31.7234 31.9373 31.0462 31.8766

90.00% 31.3725 30.3387 29.1979 31.9786 30.4100 30.6595

94.99% 28.6121 30.0356 31.3167 31.3167 31.2456 30.5053

97.49% 30.4410 27.027 30.0142 26.7425 28.734 28.5917

Connect-4 Data Set

50.00% 18.9976 19.3528 19.4002 18.8229 19.3025 19.1752

60.00% 18.562 18.3325 18.4028 18.6323 18.636 18.5131

66.70% 18.5803 18.7892 18.2113 17.8824 18.5714 18.4069

70.00% 18.3895 17.9898 18.2119 18.0096 18.0737 18.1349

80.00% 17.9174 18.0876 17.8138 18.2356 18.1912 18.0491

90.00% 17.54 17.7768 17.2143 18.1024 17.4364 17.6139

94.99% 18.8869 17.2883 17.3475 17.3475 17.7916 17.7323

97.49% 16.6371 16.8739 16.6963 17.1107 17.2291 16.9094


50.00% 5.1195 4.7995 4.7395 4.8095 4.8395 4.86%

60.00% 4.1505 4.1505 4.5631 4.3005 4.4131 4.31%

66.70% 4.7912 3.9952 4.3106 4.2655 3.7999 4.23%

70.00% 4.1167 3.4333 3.65 4.6333 3.8333 3.93%

80.00% 4.1229 3.5482 3.1984 3.8481 3.973 3.79%

90.00% 3.45 2.85 2.5 2.7 3.05 2.91%

94.99% 3 2.7 2.8 3.3 3.4 3.04%

97.49% 3.4 2.4 4.4 3.8 3.8 3.56%


50.00% 1.4947 2.0996 1.9217 2.2776 2.0996 1.98%

60.00% 1.5125 1.8683 1.7794 2.0018 1.8683 1.80%

66.70% 1.8192 2.1402 2.0332 2.3007 1.6051 1.97%

70.00% 2.3145 2.3145 1.9585 2.1365 1.7804 2.10%

80.00% 1.2456 1.5125 1.6904 1.7794 2.0463 1.65%

90.00% 1.9573 1.2456 1.4235 1.0676 0.7117 1.28%

94.99% 1.4184 2.1277 1.0638 1.0638 1.4184 1.41%

97.49% 0 0 1.4286 1.4286 4.2857 1.42%

Pen-based Digits Data Set

50.00% 1.2911 0.8001 1.4184 1.3093 1.1275 1.19%

60.00% 0.9097 0.9552 1.0917 1.0689 1.1826 1.04%

66.70% 0.929 0.929 1.0383 1.3388 0.929 1.03%

70.00% 0.8799 1.2743 1.3956 0.8799 1.1833 1.12%

80.00% 1.0914 0.955 1.0005 0.9095 0.9095 0.97%

90.00% 1.2727 1.0000 0.9091 1.4545 0.8182 1.09%

94.99% 0.9091 0.7273 1.2727 0.3636 0.5455 0.76%

97.49% 0.7273 0.7273 1.4545 1.0909 0.3636 0.87%

Land Satellite Image Data Set

131

50.00% 9.6644 9.074 8.0796 9.0429 8.7011 8.91%

60.00% 8.7058 8.7058 8.0451 8.6281 8.9001 8.59%

66.70% 9.3414 8.5474 8.0336 8.7342 8.8744 8.70%

70.00% 8.1347 9.2746 8.0829 8.4456 8.4974 8.48%

80.00% 8.6957 8.4627 7.5311 9.2391 8.0745 8.40%

90.00% 8.6957 7.6087 7.764 8.3851 8.5404 8.19%

94.99% 9.3458 7.4766 9.0343 8.4112 9.6573 8.78%

97.49% 8.0247 7.4074 4.9383 4.9383 6.7901 6.41%

SpamBase Data Set

50.00% 5.1304 4.4783 5.0435 5.1304 4.913 4.93%

60.00% 3.8043 4.6196 5.0543 3.913 4.7826 4.43%

66.70% 4.9608 4.5692 4.3734 4.047 5.6789 4.72%

70.00% 5.0725 6.3768 4.7826 5.0725 4.7101 5.20%

80.00% 5.5375 4.2345 4.886 3.9088 5.8632 4.88%

90.00% 4.1304 4.3478 4.5652 4.7826 3.6957 4.30%

94.99% 5.6522 7.3913 4.7826 6.087 3.913 5.56%

97.49% 2.6087 6.087 4.3478 3.4783 6.9565 4.69%

Table B.13: Combined Results of Error Rate by C4.5, CART, YaDT & Random Forest


Error Rate by C4.5

Error Rate by CART

Error Rate by YaDT


Chess Data Set

50.00% 38.96% 35.68% 32.72% 37.69%

60.00% 36.18% 34.44% 33.94% 35.35%

66.70% 33.96% 32.72% 36.22% 33.86%

70.00% 33.70% 32.62% 38.19% 33.71%

80.00% 31.52% 30.72% 41.16% 31.87%

90.00% 29.18% 29.06% 36.62% 30.65%

94.99% 28.80% 29.20% 32.08% 30.50%

97.49% 26.78% 27.12% 29.90% 28.59%


49.97% 4.74% 5.32% 4.84% 1.19%

60.00% 3.92% 4.76% 4.15% 1.04%

66.70% 4.14% 4.50% 4.27% 1.03%

70.01% 3.74% 4.34% 3.90% 1.12%

79.99% 3.68% 4.08% 3.72% 0.97%

89.99% 2.94% 3.76% 3.27% 1.09%

132

95.00% 2.96% 4.22% 2.90% 0.76%

97.50% 2.98% 3.62% 3.63% 0.87%


50.00% 11.20% 11.34% 11.49% 1.98%

60.00% 10.78% 11.28% 11.21% 1.80%

66.74% 10.58% 10.48% 10.76% 1.97%

70.02% 10.74% 11.36% 10.97% 2.10%

80.00% 9.52% 9.28% 9.64% 1.65%

90.00% 8.72% 10.08% 9.35% 1.28%

94.98% 8.60% 8.30% 8.93% 1.41%

97.51% 8.30% 7.12% 8.28% 1.42%

SpamBase Data Set

50.01% 8.70% 9.00% 8.61% 4.93%

60.01% 7.98% 8.32% 7.95% 4.43%

66.70% 7.96% 7.86% 8.01% 4.72%

70.01% 8.36% 8.56% 8.01% 5.20%

79.98% 7.34% 7.00% 7.25% 4.88%

90.00% 7.06% 6.84% 7.34% 4.30%

95.00% 8.08% 8.56% 8.00% 5.56%

97.50% 7.14% 6.60% 8.69% 4.69%

Connect-4 Data Set

50.00% 21.12% 32.72% 21.53% 19.18%

60.00% 20.64% 32.34% 20.95% 18.51%

66.70% 20.08% 31.38% 20.19% 18.41%

70.00% 20.32% 30.96% 20.43% 18.13%

80.00% 19.62% 31.42% 19.81% 18.05%

90.00% 19.08% 30.08% 18.96% 17.61%

95.00% 18.50% 31.70% 18.72% 17.73%

97.50% 18.12% 29.58% 18.57% 16.91%


50.00% 15.16% 17.42% 15.26% 4.86%

60.01% 13.92% 15.80% 13.97% 4.31%

66.71% 13.38% 15.36% 13.45% 4.23%

70.00% 13.30% 14.88% 13.37% 3.93%

79.99% 12.56% 14.26% 12.85% 3.79%

90.00% 12.04% 13.34% 12.05% 2.91%

95.00% 11.38% 12.94% 11.42% 3.04%

97.50% 11.32% 13.28% 10.92% 3.56%

Land Satellite Data Set

49.99% 14.62% 14.84% 14.47% 8.91%

60.02% 13.74% 14.56% 13.46% 8.59%

66.73% 14.16% 14.44% 14.00% 8.70%

70.01% 13.86% 14.48% 14.16% 8.48%

79.98% 13.90% 13.98% 14.63% 8.40%

89.99% 14.00% 13.80% 14.38% 8.19%

133

95.01% 13.52% 13.20% 13.58% 8.78%

97.48% 11.36% 10.84% 11.85% 6.41%

Table B.14: Two Tailed Test Results of YaDT vs. CART, C4.5

Split Ratio





Error Rate with Random Forest (E4)

T1 T2 T3

Chess Data Set

50.0% 14027 32.72 38.96 35.68 37.69 10.89 5.22 8.71

60.0% 11221 33.94 36.18 34.44 35.35 3.51 0.79 2.21

66.7% 9342 36.22 33.96 32.72 33.86 3.23 5.03 3.38

70.0% 8417 38.19 33.70 32.62 33.71 6.07 7.55 6.05

80.0% 5611 41.16 31.52 30.72 31.87 10.61 11.52 10.21

90.0% 2805 36.62 29.18 29.06 30.65 5.93 6.02 4.73

94.9% 1405 32.08 28.80 29.20 30.50 1.88 1.65 0.90

97.4% 703 29.90 26.78 27.12 28.59 1.29 1.15 0.54

Connect-4 Data Set

50.0% 33778 21.53 21.12 32.72 19.18 1.30 32.10 7.58

60.0% 27023 20.95 20.64 32.34 18.51 0.88 29.94 7.12

66.7% 22497 20.19 20.08 31.38 18.41 0.29 27.12 4.78

70.0% 20267 20.43 20.32 30.96 18.13 0.27 24.25 5.86

80.0% 13512 19.81 19.62 31.42 18.05 0.39 21.86 3.69

90.0% 6756 18.96 19.08 30.08 17.61 0.17 15.02 2.02

95.0% 3378 18.72 18.50 31.70 17.73 0.23 12.28 1.05

97.5% 1689 18.57 18.12 29.58 16.91 0.33 7.48 1.26


50.0% 10001 15.26 15.16 17.42 5.20 0.19 4.13 23.47

60.0% 7999 13.97 13.92 15.80 4.68 0.09 3.25 20.20

66.7% 6658 13.45 13.38 15.36 4.50 0.11 3.13 18.06

70.0% 6000 13.37 13.30 14.88 4.24 0.11 2.37 17.64

79.9% 4002 12.85 12.56 14.26 3.92 0.38 1.84 14.41

90.0% 2000 12.05 12.04 13.34 3.12 0.009 1.25 10.66

95.0% 1000 11.42 11.38 12.94 3.26 0.028 1.03 6.99

97.5% 500 10.92 11.32 13.28 3.60 0.201 1.14 4.46

Opt Digits Data Set

50.0% 2810 11.49 11.20 11.34 2.11 0.34 0.17 13.96

60.0% 2248 11.21 10.78 11.28 2.03 0.46 0.07 12.37

67.7% 1869 10.76 10.58 10.48 2.16 0.178 0.27 10.69

70.0% 1685 10.97 10.74 11.36 2.21 0.21 0.35 10.24

80.0% 1124 9.64 9.52 9.28 1.81 0.096 0.29 7.98

90.0% 562 9.35 8.72 10.08 1.42 0.368 0.41 5.88

134

94.9% 282 8.93 8.60 8.30 1.49 0.138 0.26 3.97

97.5% 140 8.28 8.30 7.12 1.42 0.006 0.36 2.67

Pen Digits Data Set

49.9% 5499 4.84 4.74 5.32 1.01 0.245 1.14 11.91

60.0% 4397 4.15 3.92 4.76 0.92 0.548 1.38 9.63

66.7% 3660 4.27 4.14 4.50 0.86 0.277 0.48 9.22

70.0% 3296 3.90 3.74 4.34 0.95 0.338 0.898 7.78

79.9% 2199 3.72 3.68 4.08 0.89 0.07 0.616 6.25

89.9% 1100 3.27 2.94 3.76 0.87 0.446 0.624 3.95

95.0% 550 2.90 2.96 4.22 0.89 0.058 1.18 2.44

97.5% 275 3.63 2.98 3.62 1.01 0.426 0.006 2.04


49.9% 3218 14.47 14.62 14.84 8.98 0.17 0.419 6.84

60.0% 2573 13.46 13.74 14.56 8.95 0.29 1.136 5.12

66.7% 2141 14.00 14.16 14.44 8.81 0.15 0.412 5.34

70.0% 1930 14.16 13.86 14.48 8.48 0.26 0.283 5.56

79.9% 1288 14.63 13.90 13.98 8.83 0.529 0.47 4.57

89.9% 644 14.38 14.00 13.80 8.35 0.195 0.299 3.40

95.0% 321 13.58 13.52 13.20 8.72 0.022 0.14 1.95

97.4% 162 11.85 11.36 10.84 6.54 0.137 0.286 1.65

Spambase Data Set

50.0% 2300 8.61 8.70 9.00 4.97 0.108 0.466 4.90

60.0% 1840 7.95 7.98 8.32 4.47 0.033 0.410 4.37

66.7% 1532 8.01 7.96 7.86 4.67 0.051 0.1535 3.79

70.0% 1380 8.01 8.36 8.56 5.36 0.335 0.524 2.78

79.9% 921 7.25 7.34 7.00 4.84 0.07 0.208 2.17

90.0% 460 7.34 7.06 6.84 4.56 0.16 0.295 1.78

95.0% 230 8.00 8.08 8.56 5.39 0.031 0.217 1.11

97.5% 115 8.69 7.14 6.60 4.69 0.435 0.596 1.21

Table B.15: Two Tailed Test Results of Random Forest vs. Three algorithms

Split Ratio


Error Rate with Random Forest (E1)




T1 T2 T3

Chess Data Set

50.0% 14027 37.69 38.96 35.68 32.72 2.18 3.49 8.71

60.0% 11221 35.35 36.18 34.44 33.94 1.29 1.43 2.21

66.7% 9342 33.86 33.96 32.72 36.22 0.14 1.65 3.38

70.0% 8417 33.71 33.70 32.62 38.19 0.01 1.50 6.05

80.0% 5611 31.87 31.52 30.72 41.16 0.39 1.31 10.2

135

90.0% 2805 30.65 29.18 29.06 36.62 1.20 1.30 4.73

94.9% 1405 30.50 28.80 29.20 32.08 0.98 0.75 0.90

97.4% 703 28.59 26.78 27.12 29.90 0.75 0.61 0.53

Connect-4 Data Set

50.0% 33778 19.18 21.12 32.72 21.53 6.28 40.14 7.58

60.0% 27023 18.51 20.64 32.34 20.95 6.24 36.91 7.12

66.7% 22497 18.41 20.08 31.38 20.19 4.49 31.91 4.78

70.0% 20267 18.13 20.32 30.96 20.43 5.59 30.01 5.86

80.0% 13512 18.05 19.62 31.42 19.81 3.30 25.46 3.69

90.0% 6756 17.61 19.08 30.08 18.96 2.20 17.00 2.02

95.0% 3378 17.73 18.50 31.70 18.72 0.82 13.30 1.05

97.5% 1689 16.91 18.12 29.58 18.57 0.92 8.71 1.26


50.0% 10001 5.20

15.16 17.42 15.26

23.29 27.28 23.4

7

60.0% 7999 4.68

13.92 15.80 13.97

20.12 23.19 20.2

0

66.7% 6658 4.50

13.38 15.36 13.45

17.95 20.95 18.0

6

70.0% 6000 4.24

13.30 14.88 13.37

17.54 19.81 17.6

4

79.9% 4002 3.92

12.56 14.26 12.85

14.05 16.09 14.4

1

90.0% 2000 3.12

12.04 13.34 12.05

10.65 11.75 10.6

6

95.0% 1000 3.26 11.38 12.94 11.42 6.97 7.93 6.99

97.5% 500 3.60 11.32 13.28 10.92 4.64 5.50 4.46

Opt Digits Data Set

50.0% 2810 2.11

11.20 11.34 11.49

13.67 13.81 13.9

6

60.0% 2248 2.03

10.78 11.28 11.21

11.98 12.44 12.3

7

67.7% 1869 2.16

10.58 10.48 10.76

10.53 10.45 10.6

9

70.0% 1685 2.21

10.74 11.36 10.97

10.06 10.56 10.2

4

80.0% 1124 1.81 9.52 9.28 9.64 7.90 7.73 7.98

90.0% 562 1.42 8.72 10.08 9.35 5.57 6.23 5.88

94.9% 282 1.49 8.60 8.30 8.93 3.85 3.74 3.97

97.5% 140 1.42 8.30 7.12 8.28 2.67 2.35 2.67

Pen Digits Data Set

49.9% 5499 1.01

4.74 5.32 4.84

11.70 12.90 11.9

1

60.0% 4397 0.92 3.92 4.76 4.15 9.15 10.83 9.63

66.7% 3660 0.86 4.14 4.50 4.27 8.98 9.64 9.22

136

70.0% 3296 0.95 3.74 4.34 3.90 7.48 8.57 7.78

79.9% 2199 0.89 3.68 4.08 3.72 6.19 6.79 6.25

89.9% 1100 0.87 2.94 3.76 3.27 3.55 4.50 3.95

95.0% 550 0.89 2.96 4.22 2.90 2.49 3.49 2.44

97.5% 275 1.01 2.98 3.62 3.63 1.65 2.03 2.04


49.9% 3218 8.98 14.62 14.84 14.47 7.01 7.25 6.84

60.0% 2573 8.95 13.74 14.56 13.46 5.41 6.24 5.12

66.7% 2141 8.81 14.16 14.44 14.00 5.49 5.74 5.34

70.0% 1930 8.48 13.86 14.48 14.16 5.30 5.84 5.56

79.9% 1288 8.83 13.90 13.98 14.63 4.05 4.11 4.57

89.9% 644 8.35 14.00 13.80 14.38 3.21 3.11 3.40

95.0% 321 8.72 13.52 13.20 13.58 1.93 1.81 1.95

97.4% 162 6.54 11.36 10.84 11.85 1.51 1.37 1.65

Spambase Data Set

50.0% 2300 4.97 8.70 9.00 8.61 5.01 5.36 4.90

60.0% 1840 4.47 7.98 8.32 7.95 4.40 4.77 4.37

66.7% 1532 4.67 7.96 7.86 8.01 3.74 3.64 3.79

70.0% 1380 5.36 8.36 8.56 8.01 3.11 3.30 2.78

79.9% 921 4.84 7.34 7.00 7.25 2.24 1.96 2.17

90.0% 460 4.56 7.06 6.84 7.34 1.62 1.49 1.78

95.0% 230 5.39 8.08 8.56 8.00 1.15 1.33 1.11

97.5% 115 4.69 7.14 6.60 8.69 0.78 0.62 1.21

137

FIGURES

138

Pruning Results

0.00%

0.00%

0.00%

0.01%

0.01%

0.01%

0.01%

0.01%

0.02%

0.00% 20.00% 40.00% 60.00% 80.00% 100.00%

Pruning Confidence

Error rate/Tree Size

Chess

Connect-4

SpamBase

PenRecog

Letter

OptDigits

LandSat

Figure C.1: Pruning Confidence of the Data Sets for YaDT algorithm.

Chess Data Set

25.00%

27.00%

29.00%

31.00%

33.00%

35.00%

37.00%

39.00%

41.00%

43.00%

45.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.2: Error Rate Variation from the Chess Data Set by YaDT

139

Connect-4 Data Set

15.00%

16.00%

17.00%

18.00%

19.00%

20.00%

21.00%

22.00%

23.00%

24.00%

25.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.3: Error Rate Variation from the Connect-4 Data Set by YaDT

SpamBase Data Set

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.4: Error Rate Variation from the SpamBase Data Set by YaDT

140

Letter Data Set

8.00%

9.00%

10.00%

11.00%

12.00%

13.00%

14.00%

15.00%

16.00%

17.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.5: Error Rate Variation from the Letter Data Set by YaDT

PenRecog Data Set

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.6: Error Rate Variation from the PenRecog based Data Set by YaDT

141

OptDigits Data Set

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

12.00%

13.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.7: Error Rate Variation from the Optical Digits Data Set by YaDT

LandSat Data Set

9.00%

10.00%

11.00%

12.00%

13.00%

14.00%

15.00%

16.00%

17.00%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.8: Error Rate Variation from the Land Satellite Data Set by YaDT

142

Pen Recog Dataset

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.9: Error Rate Variation from the Pen Recognition Data Set by Random Forest

LandSat Dataset

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%


Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.10: Error Rate Variation from the Land Satellite Data Set by Random Forest

143

Opt Digits DataSet

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

4.00%

4.50%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.11: Error Rate Variation from the Optical Digits Data Set by Random Forest


2.00%

2.50%

3.00%

3.50%

4.00%

4.50%

5.00%

5.50%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.12: Error Rate Variation from the Letter Recognition Data Set by Random Forest

144

SpamBase DataSet

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%


Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.13: Error Rate Variation from the SpamBase Data Set by Random Forest

Connect-4 Data Set

16

16.5

17

17.5

18

18.5

19

19.5

20


Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.14: Error Rate Variation from the Connect-4 Data Set by Random Forest

145

Chess Data Set

25%

27%

29%

31%

33%

35%

37%

39%

1 2 3 4 5 6

Training Set Number

Error Rate

50%

60%

66.70%

70%

80%

90%

95%

97.50%

Figure C.15: Error Rate Variation from the Chess Data Set by Random Forest

LandSat Dataset

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

12.00%

13.00%

14.00%

15.00%

16.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate C4.5

CART

YaDT

RForest

Figure C.16: Relationship of Training Data Size to Error Rate – Land Satellite Data Set

146

SpamBase Dataset

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate

C4.5

CART

YaDT

RForest

Figure C.17: Relationship of Training Data Size to Error Rate – SpamBase Data Set

Pen Digits Dataset

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate

C4.5

CART

YaDT

RForest

Figure C.18: Relationship of Training Data Size to Error Rate – PenRecog Data Set

147

OptDigits Dataset

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate

C4.5

CART

YaDT

RForest

Figure C.19: Relationship of Training Data Size to Error Rate – Opt Digits Data Set


2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00%

Training Data Set

Error Rate C4.5

CART

YaDT

RForest

Figure C.20: Relationship of Training Data Size to Error Rate – Letter Recognition

Data Set

148

Chess Data Set

25.00%

27.00%

29.00%

31.00%

33.00%

35.00%

37.00%

39.00%

41.00%

43.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00% 105.00%

Training Data Set

Eror Rate

C4.5

CART

YaDT

RForest

Figure C.21: Relationship of Training Data Size to Error Rate – Chess Data Set

Connect-4 Data Set

15.00%

17.00%

19.00%

21.00%

23.00%

25.00%

27.00%

29.00%

31.00%

33.00%

35.00%

45.00% 55.00% 65.00% 75.00% 85.00% 95.00% 105.00%

Training Data Set

Error Rate C4.5

CART

YaDT

RForest

Figure C.22: Relationship of Training Data Size to Error Rate – Connect-4 Data Set

PERMISSION TO COPY

In presenting this thesis in partial fulfillment of the requirements for a master’s

degree at Texas Tech University or Texas Tech University Health Sciences Center, I

agree that the Library and my major department shall make it freely available for research

purposes. Permission to copy this thesis for scholarly purposes may be granted by the

Director of the Library or my major professor. It is understood that any copying or

publication of this thesis for financial gain shall not be allowed without my further

written permission and that any user may be liable for copyright infringement.

Agree (Permission is granted.)

_____Ratheesh Raghavan____________________________ ____04/13/2006____

Student Signature Date

Disagree (Permission is not granted.)

_______________________________________________ _________________

Student Signature Date

and random forest algorithms a thesis

Documents