dissertation classiﬁer diversity in combined...

DISSERTATION

Classifier Diversity

in Combined

Pattern Recognition Systems

A Thesis

Presented to the School of Information and Communication Technologies

University of Paisley

In Partial Fulfillment

of the Requirements for the Degree

Doctor of Philosophy

By

Dymitr Ruta MSc, Eng

Applied Computational Intelligence Research Unit

University of Paisley, Scotland

September 2003

- To Ola and Robert, my Mum and Dad -

i

Abstract

This work covers explorative investigations of diversity in relation to multiple

classifier systems (MCS). The notion of diversity emerged as an attempt to explain

the sources of considerable performance improvement that can be observed when

classifiers are combined. At this early stage of development of a young and promis-

ing discipline of classifier fusion, the decision as to whether to choose the best, or

combine, and if so then which models, is unclear. With respect to these problems,

the role of diversity as an explanative and diagnostic tool guiding optimal design of

a multiple classifier system is addressed and thoroughly examined in three different

contexts:

majority voting performance and its limits;

relation between diversity measures and combined performance;

and classifier selection guided by various criteria

In the case of majority voting (MV), the behaviour of combined performance

is investigated and tracked back to the specific distributions of classifier outputs in

an attempt to extract classifier characteristics that could explain variability of com-

bined performance. Indepth parametric analysis of the impact of classifier outputs

distribution and various parameters of MCS on combined performance is conducted.

The results provide clear and comprehensive explanations of what makes majority

voting work, facilitated by a number of novel findings related to MV error limits,

extendibility of MCS and optimal patterns of outputs distribution.

Given a clear picture of the mechanisms driving performance gain in combined

systems, various models of diversity are evaluated in terms of their ability to ex-

plain the variability of combined performance and/or its improvement over indi-

vidual classifiers. Complex co-involvement of individual performances and various

relationships among classifier outputs in their relation with MV performance re-

vealed dissonance between traditionally perceived diversity and the performance of

majority voting. The constructive conclusions from that analysis laid grounds for

the development of a new strategy for constructing diversity measures that are opti-

mised with respect to the combiner. To that end, two novel diversity measures have

been proposed using systematic and set based analysis, and their advantages over

existing diversity measures have been demonstrated experimentally. These promis-

ing results together with the concept of ambiguity adopted from regression problems

ii

provided an inspiration for extending the strategy of modelling the improvement of

combiner performance, up to using directly combined performance in order to satisfy

requirements set for the diversity measures. It is demonstrated and experimentally

justified that such combiner specific perception of the diversity is more suitable for

the applications to diagnostics and design of MCS.

Classifier selection represents the ultimate test of the usefulness of diversity in

practical applications of multiple classifier systems. Complex though precise per-

formance driven classifier selection methods are confronted with simple diversity

guided selection techniques. Extensive experimental work with a number of novel

searching algorithms is carried out and its results used for development of an orig-

inal multistage organisation system employing both classifier fusion and selection

on many layers of its structure. A new mechanism of processing a number of best

classifier combinations at each layer is finally proposed and its positive effects on

the generalisation ability of the whole system is demonstrated over a number of

standard datasets.

Declaration

The work contained in this thesis is the result of my own investigations and has

not been accepted nor concurrently submitted in candidature for any other award.

Copyright c2003 Dymitr Ruta

The copyright of the thesis belongs to the author under the terms of the United

Kingdom Copyright Acts as qualified by the University of Paisley. Due acknowl-

edgements must be made of the use of any material contained in, or derived from,

this thesis. Power of discretion is granted to the depository libraries to allow the

thesis to be copied in whole or in part without further reference to the author. This

permission covers only single copies made for study purposes, subjects to normal

conditions of acknowledgement.

iv

Acknowledgments

I am deeply indebted to my supervisor Dr Bogdan Gabrys for his courage to

take me for his first PhD student and attack very young and uncertain discipline

of combined pattern recognition systems. With his passion in intelligent systems

combined with the emerging potential from the novel area of information fusion he

encouraged me to join the battle for alternative improvement of the pattern recogni-

tion systems - classifier fusion. His invaluable gift of filtering out and proposing good

ideas, efficient brain-storming sessions were important factors stimulating successful

accomplishment of this thesis. Full credit goes also to him for the establishment of

the financial support for the whole project.

The stimulating discussions with my second supervisor Prof. Colin Fyfe, and

also his great generosity in validating my participations in a number of research

conferences are also gratefully acknowledged.

It is a pleasure to express my gratitude to all the members of our Applied Com-

putational Intelligence Research Unit in particular to Lina Petrakieva for their ever-

lasting willingness to dispute computational, mathematical and philosophical issues

and for the excellent ambience in which doing research was a real pleasure.

Supplementary, the input and interest of the Pattern Recognition Group of the

Delf University of Technology led by Robert Duin who developed Matlab Pattern

Recognition Toolbox (PRTools) was of great help.

I have profited from the numerous exchanges of views and e-mails with several

experienced colleagues, actively participating the series of International Workshops

on Multiple Classifier Systems.

Finally, I wish to send hugs and kisses to my wife Aleksandra for several private

reasons, but particularly for her constant engagement with my son Robert, which

was a necessary condition for this dissertation being done.

v

Contents

Contents vi

List of Figures ix

List of Tables xv

Abbreviations xvi

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Overview of pattern recognition and classifier fusion 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Pattern classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Classifier design cycle . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Classification error . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Information fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Data fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Feature fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.3 Decision fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.4 Classifier outputs . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Classifier fusion systems . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Combining based on classifiers outputs . . . . . . . . . . . . . 29

2.4.2 Combining based on training style . . . . . . . . . . . . . . . . 34

2.4.3 Coverage vs decision optimisation . . . . . . . . . . . . . . . . 35

2.4.4 Decomposition approaches . . . . . . . . . . . . . . . . . . . . 36

2.4.5 Properties of classifier fusion . . . . . . . . . . . . . . . . . . . 37

vi

Contents vii

3 Combining classifiers by majority voting 46

3.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Combining independent classifiers . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Bernoulli model . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2 Relaxation of the equal performance assumption . . . . . . . . 51

3.2.3 Parametric performance analysis . . . . . . . . . . . . . . . . 53

3.2.4 Beneficial system extendibility . . . . . . . . . . . . . . . . . . 55

3.3 Error limits for dependent classifiers . . . . . . . . . . . . . . . . . . . 60

3.3.1 Patterns of boundary error distribution . . . . . . . . . . . . . 61

3.3.2 Stable boundary error distributions . . . . . . . . . . . . . . . 64

3.3.3 The limits of majority voting error . . . . . . . . . . . . . . . 67

3.4 Multistage organisations . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.4.1 Optimal distribution of outputs for MOMV . . . . . . . . . . 72

3.4.2 Optimal permutation . . . . . . . . . . . . . . . . . . . . . . . 75

3.4.3 Optimal structure . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.4.4 Error limits for MOMV . . . . . . . . . . . . . . . . . . . . . 77

3.5 Performance stability of majority voting - experimental insight . . . . 79

3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4 The notion of diversity 86

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1.1 Software diversity . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.1.2 Classifier diversity . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.3 Perception of diversity . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Measuring diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.1 Pairwise diversity measures . . . . . . . . . . . . . . . . . . . 96

4.2.2 Non-pairwise diversity measures . . . . . . . . . . . . . . . . . 97

4.2.3 Diversity measure properties . . . . . . . . . . . . . . . . . . . 99

4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3 Analysis of error coincidences for majority voting . . . . . . . . . . . 104

4.3.1 Error distributions . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3.2 Set representation of coincident errors . . . . . . . . . . . . . . 112

4.3.3 Relations with majority voting . . . . . . . . . . . . . . . . . . 119

4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.4 Combiner specific diversity . . . . . . . . . . . . . . . . . . . . . . . . 128

4.4.1 Usefulness of diversity . . . . . . . . . . . . . . . . . . . . . . 131

Contents viii

4.4.2 Relative error measure . . . . . . . . . . . . . . . . . . . . . . 131

4.4.3 Complexity reduction . . . . . . . . . . . . . . . . . . . . . . . 133

4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.4.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . 137

5 Classifier selection 141

5.1 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.1.1 Static vs dynamic selection . . . . . . . . . . . . . . . . . . . . 144

5.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.1.3 Selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.2.1 Heuristic techniques . . . . . . . . . . . . . . . . . . . . . . . 148

5.2.2 Greedy approaches . . . . . . . . . . . . . . . . . . . . . . . . 149

5.2.3 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . 150

5.2.4 Experimental investigations . . . . . . . . . . . . . . . . . . . 154

5.3 Multistage selection-fusion model (MSF) . . . . . . . . . . . . . . . . 161

5.3.1 Network of outputs . . . . . . . . . . . . . . . . . . . . . . . . 164

5.3.2 Analysis of generalisation ability . . . . . . . . . . . . . . . . . 165

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6 Conclusions 171

6.1 Justification for the line of research . . . . . . . . . . . . . . . . . . . 171

6.2 Major findings and contributions . . . . . . . . . . . . . . . . . . . . 172

6.3 The role of diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.4 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

A Datasets and classifiers used in Experiments 180

A.1 Description of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 180

A.2 Description of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 185

B Generation of classification outputs 191

B.1 The training methodology . . . . . . . . . . . . . . . . . . . . . . . . 191

B.2 Testing individual classifiers . . . . . . . . . . . . . . . . . . . . . . . 192

Bibliography 195

List of Figures

2.1 Pattern recognition and classification design cycles . . . . . . . . . . . 11

2.2 Two examples of two dimensional datasets. . . . . . . . . . . . . . . . 14

2.3 Visualisation of the training process for 3 common classifiers. Plots

b,c,d show superposition of discriminative functions within 2-dimensional

feature space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Operational scope of fusion in combining classifiers . . . . . . . . . . 23

2.5 Classifier outputs. Transferability of one type into another (top).

Different soft measures and their associations (bottom) . . . . . . . . 27

2.6 Training ability of the fusion operator. . . . . . . . . . . . . . . . . . 35

2.7 Different variations of optimisation relations among data (D), classi-

fiers (C) and fusion operator (F). Greyed examples represent optimi-

sation models not yet designed. . . . . . . . . . . . . . . . . . . . . . 43

2.8 Combining architectures. Different models of decision processing

(top). Decision aggregation models - comparison between organi-

sation and network (bottom) . . . . . . . . . . . . . . . . . . . . . . . 45

3.1 Discrete error distribution with normal distribution approximation.

15 independent classifiers have been used with 40% error each. Shaded

bars refer to errors in majority voting sense. The majority voting er-

ror rate corresponds to the sum of all shaded bars. . . . . . . . . . . . 53

3.2 Normalised continuous error distribution for 15 independent classi-

fiers with 40% error each. Shaded area refers to majority voting error

rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 A family of normalised continuous error distributions for increasing

number of classifiers with the same individual error rates of 40%.

Decreasing shaded area corresponds to reducing majority vote error. . 56

ix

Contents x

3.4 Variability of the normalised variance and its effect on majority voting

error. The continuous line represents the maximum variance limit

subject to fixing the mean and the number of classifiers. The surfaces

depict random variability of the normalised variance presented as a

function of the normalised mean error rate and with correspondence

to majority voting error. . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 The relationship (3.23) between majority voting error and error rates

e2 and e3 of a pair of classifiers added to a single classifier with error

rate e1 = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6 Extendibility curves for different errors of a single classifier. Dashed

lines limit the area corresponding to individual errors of joining clas-

sifiers e1, e2 greater than error e1 but producing MV error lower than

e1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7 Discrete error distributions for Iris, Biomed, and Chromo datasets

classified by 15 different classifiers (see Appendix A for details of

datasets and classifiers). Shaded bars correspond to errors in majority

voting sense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.8 Visualisation of the distribution of success (DS) and failure (DF) for

Iris, Biomed, and Chromo datasets classified by 15 different classifiers

(see Appendix A for details of datasets and classifiers). Shaded bars

correspond to errors in majority voting sense. . . . . . . . . . . . . . 64

3.9 Visualisation of stable distributions of success and failure for Iris,

Biomed, and Chromo datasets classified by 15 different classifiers (see

Appendix A for details of datasets and classifiers). Shaded bars cor-

respond to errors in majority voting sense. . . . . . . . . . . . . . . . 67

3.10 Majority voting error limits presented as a function of the number

of classifiers (M = 3 : 99) and mean classifier error rate. Dotted

lines in the 2-D projection (b) represent independent MV error and

correspond to the internal surface in 3-D plot (a). . . . . . . . . . . . 69

3.11 Multistage organisation with 15 classifiers and structure S15 = (5, 3).

The outputs from the classifiers are permutated and passed to layer 1.

At each layer majority voting is applied to each group and the outputs

are passed on to the next layer until the final output is obtained. . . . 71

3.12 Multistage organisation with 27 classifiers and structure S27 = (3, 3, 3).

The first four rows illustrate examples of optimal permutations of out-

puts for given structure. Note that as little as 8 out 27 1s at the first

layer can propagate the correct decision up to the final layer. . . . . . 72

Contents xi

3.13 Majority voting error limits for MOMV presented as a function of

the number of classifiers (M = 3 : 2187) and mean classifier error

rate. Dotted lines on the 2-D projection (b) represent independent

MV error and correspond to the internal surface in 3-D plot (a). . . . 78

3.14 Majority vote errors observed for different boundary error distribu-

tions expressed as a function of mutation rate and mean classifier

error. Plots (a)-(f) correspond to DS, DF, SDS, SDF, DSMOMV ,

DFMOMV respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.15 Differences among majority voting errors for different boundary er-

ror distributions expressed as a function of mutation rate and mean

classifier error. Plots (a),(b) correspond to DS-SDS, DF-SDF, plots

(c),(d) show the differences between DSDSMOMV , DFDFMOMV ,and plots (e),(f) show the differences between SDS DSMOMV ,SDF DFMOMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1 Venn diagrams visualising the concept of diversity among classifiers.

Classifiers - the grey thin-lined circles - are trying to estimate the

true target classification function T - empty thick-lined circle. . . . . 92

4.2 Diagrams depicting relationship between diversity measures and (a)

MVE, (b) MVI. The position of each cell determines the correspond-

ing diversity measure (colums) and the dataset (rows) for which the

analysis was carried out. The points in each cell depict a depen-

dence between diversity measure and MVE (a), MVI (b) obtained

for all combinations of 3 out of 15 classifiers. Details of datasets and

classifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . 105

4.3 Diagrams presenting correlation coefficients between diversity mea-

sures and (a) MVE, (b) MVI. Fields in a grid correspond to various

measures and datasets as in 4.2. The darker the field the higher cor-

responding correlation coefficient. The bars underneath the diagrams

depict the correlation coefficients averaged along all datasets. Details

of datasets and classifiers are provided in Appendix A. . . . . . . . . 106

4.4 Averaged evolution of the correlation coefficients between diversity

measures and (a) MVE, (b) MVI. The graphs show the average corre-

lation coefficient measured for all combinations of 3,5,...,13 classifiers

from the ensemble of 15 classifiers. Details of datasets and classifiers

are provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . 107

Contents xii

4.5 Discrete error distributions presented for the ensembles of 15 clas-

sifiers on 27 datasets. The shapes of error distributions (bars with

continuous line joining their tops) are compared with the equivalent

distributions for independent classifiers (continuous lines). Details of

datasets and classifiers are provided in Appendix A. . . . . . . . . . . 109

4.6 Error distribution (thick line) decomposed into 15 partial error distri-

butions (thin lines) corresponding to 15 classifiers applied to Chromo

dataset. Details of datasets and classifiers are provided in Appendix A.111

4.7 Relationship between fault majority measure (FM) and the majority

voting error obtained for the combinations of 3 out of 15 classifiers

over 4 typical datasets. For comparison the same plots have been

obtained for MVE, F2 and ME measures analysed in Section 4.2.4.

Correlation coefficient c is included for each graph. . . . . . . . . . . 113

4.8 Visualisation of a set representation of coincident errors. (A) Binary

outputs from 3 classifiers (0-correct, 1-error). (B),(C) Venn Diagrams

showing all mutually exclusive subsets. (D) Venn Diagram with the

indices of samples put in the appropriate subsets positions. . . . . . . 114

4.9 Venn Diagrams for more than 3 classifiers. (A) 5 congruent ellipses.

(B) 6 triangles. (C) 7 symmetrical sets - Grunbaum construction.

(D) bipartite plot of 8 sets - Edwards construction. See [124] for

further details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.10 Two types of error coincidences for the classifier D1 and D3 of the

ensemble {D1, D2, D3}. (A) An example of error indices distribution.(B) General coincidences CG({D1, D3}) = {3, 5, 6}. (C) Exclusivecoincidences CE({D1, D3}) = {6}. . . . . . . . . . . . . . . . . . . . . 117

4.11 Collection generation. A: Algorithm. B: Visualisation of the collec-

tion generation process. . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.12 Graphs associated with Venn Diagrams. A: An ordered graph of

exclusive coincidences for 3 classifier sets. B: Unordered graph for

Edwards construction of 5 sets. To order the graph, all vertices have

to be directed towards lower order coincidence. . . . . . . . . . . . . . 119

4.13 Evolution of correlation coefficients along different levels of GC. Cor-

relation coefficient were measured between MVE and GC grouped in

series of 3,5,7,9 out of 11 classifiers for 8 considered datasets. . . . . . 124

Contents xiii

4.14 Evolution of correlation coefficients along different levels of EC. Cor-

relation coefficients were measured between MVE and EC levels grouped

in series of 3, 5, 7, 9 out of 11 classifiers for 2 representative datasets

showing typical patterns of the relationship observed. . . . . . . . . . 125

4.15 Evolution of correlation coefficients between MVE and type 1 sum

(from 1st to kth of GC levels presented as a function of a number

of GC levels taken to the sum (shown in bold lines). For compari-

son, correlation curves of the individual GC levels are also shown in

thin lines. Plots are presented for 4 datasets corresponding to most

representative patterns of the relationship observed. . . . . . . . . . . 125

4.16 Evolution of correlation coefficients between MVE and type 2 and

3 sums of coincidence levels shown as a function of the number of

levels taken to the sum. A: type 2 sum (from kth to M th level) of

EC levels, (shown in bold lines). For comparison, correlation curves

of the individual EC levels are also shown in thin lines. B: type 3

sum of GC levels (bold lines) with correlation curves of the individual

GC levels shown in thin lines. Details of datasets and classifiers are

provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.17 Illustration of the importance of correlation coefficients for classifier

selection in the example of the relation between majority voting error

and general coincidence levels of 3 out of 11 classifiers applied for the

Liver dataset. (A) Relation of the first general coincidence levels. (B)

Relation of the second general coincidence levels. (C) Relation of the

sum of the first and second general coincidence level with majority

voting error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.18 Graphical interpretation of the RE in two versions: with E0 as in-

dependent majority voting error 4.18(a), E0 denoting mean classifier

error 4.18(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.19 Linear regression of the normalised higher levels of general coinci-

dence calculated as a result of the 11 classifier system applied to

some typical real-world datasets. (a) The LGi values for increasing

levels in logarithmic scale. (b) Lines matched in the logarithmic scale

to the higher levels (6:11) of general coincidence. . . . . . . . . . . . . 136

4.20 Visualisation of correlations between the improvement of the major-

ity voting error and the measures from Table 4.4. Coordinates of all

points represent the measures examined for all 3-element combina-

tions out of 11 classifiers for which the measures were applied. . . . . 138

Contents xiv

4.21 The diversity separation experiment. Majority voting error limits

diagrams with the points corresponding to the classification results

for increasingly trained teams of 5 classifiers. Suspected constant

diversity of the data matches the lines representing the same values

of RE measure with independent majority voting error as 0-point

(E0) 4.21(a), conversely to the second version of the RE measure with

E0 denoting mean classifier error 4.21(b). . . . . . . . . . . . . . . . . 139

5.1 Visualisation of the majority voting errors presented in Table 5.3.

The lighter the field the lower the majority voting error. Details of


5.2 Comparison of the errors from 50 best combinations of classifiers

found by four population-based searching methods: ES, SS, GS, PS. . 164

5.3 Evolution of the MVE for the MSF model with a network of 5 layers

and 15 nodes at each layer. The thick line shows the MVE values

for the best combinations found by different search algorithms at

each layer (1-5) of the MSF model. For comparison purposes this

line starts from the error of the single best classifier (layer 0), the

level of which is also marked by the dotted line. The thin line shows

the analogous evolution of the mean MVE from all the combinations

selected at each layer. Details of datasets and classifiers are provided

in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.4 The network (5 15) resulting from the application of MSF modelwith M = 15 classifiers, majority voting and exhaustive search on the

phoneme dataset. Layer 0 represents individual classifiers and their

individual errors are marked underneath. The best combination at

each layer is marked by an enlarged black circle. The validation

and testing errors of the best combination at each layer is marked

respectively below the layer labels. Details of datasets and classifiers


List of Tables

4.1 Summary of the measures applied in the experiments. . . . . . . . . . 103

4.2 Comparison of the time needed to extract cardinalities of all general

coincidences from a binary matrix of outputs and a collection for

different number of classifiers. . . . . . . . . . . . . . . . . . . . . . . 121

4.3 Comparison between the real and approximated values of the major-

ity voting error for all datasets and applying all 11 classifiers. The

error rates are shown in percentages. . . . . . . . . . . . . . . . . . . 135

4.4 Correlations between the improvement of the majority voting error

over the mean classifier error (MVE-ME) and both versions of the

RE measure compared against Q statistics and double fault mea-

sures. The correlation coefficients were measured separately for the

combinations of 3, 5, 7, and 9 out of 11 classifiers within each dataset. 137

5.1 Individual best classifier errors for 27 available datasets. The first 3

columns correspond to majority voting errors obtained for SB applied

to validation matrix, testing matrix and validation matrix but tested

on the testing matrix. The following two columns show the index

of the best classifier evaluated separately in BV and BT matrices.

Details of datasets and classifiers are provided in Appendix A. . . . . 155

5.2 Summary of searching methods, selection criteria and datasets used

in experiments. Description of datasets and classifiers is provided in

Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.3 Majority voting errors obtained for best combinations of classifiers

selected by various searching methods (columns) and selection crite-

ria (rows). The results are averaged over 27 datasets. The bottom

row and right-most column show the averaged values of MVE for

the searching methods and selection criteria respectively. Details of


xv

Abbreviations xvi

5.4 Best combination of classifiers found by the exhaustive search from

the ensemble of 15 classifiers. Columns 2-4 present the MVE val-

ues for the best combination found in the validation matrix, testing

matrix and validation best tested on the testing matrix, respectively.

Columns 4 and 5 show indices of the classifiers forming the best val-

idation and testing combinations. Details of datasets and classifiers


5.5 Validation errors (obtained from the validation matrices) of the ma-

jority voting combiner obtained for the best combinations and mean

from 50 best (if possible) combinations of classifiers found by 8 dif-

ferent search algorithms for 27 datasets. Details of datasets and clas-

sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 162

5.6 Generalisation errors (evaluated on the testing matrices) of the ma-

jority voting combiner obtained for the best combinations and mean

from 50 best (if possible) combinations of classifiers found by 8 dif-

ferent search algorithms for 27 datasets. Details of datasets and clas-

sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 163

5.7 Generalisation errors (evaluated on the testing matrices) of the ma-

jority voting combiner obtained for the best combinations from the

5-layer selection-fusion model. The columns show the minimum er-

rors obtained and the layer indices at which the minimum errors were

observed. Details of datasets and classifiers are provided in Appendix

A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

A.1 A list of datasets used in the experiments. . . . . . . . . . . . . . . . 186

A.2 A list of classifiers used in the experiments. . . . . . . . . . . . . . . . 190

B.1 Optimal classifiers parameters found exhaustively for each datatset.

The remaining classifiers (loglc, nmc, pfsvc, knnc, parzenc) have in-

ternal optimisation or are working well with default parameters. . . . 193

B.2 Individual classifier errors obtained during classification of 27 datasets.194

Abbreviations

ANN Artificial Neural Network

BKS Behaviour Knowledge Space

BS Backward Search

CC Computational Complexity

CFD Coincident Failure Diversity

DCS Dynamic Classifier Selection

DED Discrete Error Distribution

DF Boundary Distribution of Failure

DFD Distinct Failure Diversity

DI Difficulty measure

DS Boundary Distribution of Success

ECOC Error Correcting Output Coding

EL Eckhardt and Lee

FM Fault Majority

FS Forward Search

GA Genetic Algorithm

GD Generalised Diversity

IA Interrater Agreement Measure

KW Kohavi Wolpert

xvii

Abbreviations xviii

LM Littlewood and Miller

MCS Multiple Classifier System

ME Mean Error

MMI Maximim Mutual Information

MOMV Multistage Organisation with Majority Voting

MV Majority Voting

MVE Majority Voting Error

MVI Majority Voting Performance Improvement

NCED Normalised Continuous Error Distribution

NDM Non-Pairwise Diversity Measure

OWA Ordered Weighted Average

PBIL Population Based Incremental Learning

PCA Principal Component Analysis

PDED Partial Discrete Error Distribution

PDM Pairwise Diversity Measure

PK Partridge and Krzanowski

RSD Random Scatter Diversity

SB Single Best

SCS Static Classifier Selection

SD Specialisation Diversity

SDF Stable Distribution of Failure

SDS Stable Distribution of Success

SS Stochastic Hill-Climbing Search

TS Tabu Search

Chapter 1

Introduction

Endowed with a number of diverse senses, humans effortlessly tackle astoundingly

complex processes that underlie the act of pattern recognition. The astonishing

ease with which we can recognise faces, understand spoken words, eliminate rotten

eggs by smell, select the right coin from a pocket by touch or distinguish beer from

champagne by taste are apparently in conflict with the overwhelming complexity

of computer based pattern recognition systems. The explanation of this superior

performance seems to be related to highly specialised and complementary sensing

models that work simultaneously and are combined by a decision mechanism in

the human brain. Recent advances in combining pattern recognition systems seem

to support this conjecture although it is still not clear what exactly drives the

improvement in their performance. Is it complementarity among individual diverse

classification models, or are there some specific strengths of a particular combiner

that cause compensation for individual errors observed in classifier fusion systems?

The unresolved co-involvement between diversity and classifier performances and

their joint impact on combined performance remains another challenge. Multi-facet

diversity is believed to be the key to the explanation of performance variability in

combining classifiers. However due to multitude of perceptions and interpretations

and hence measuring methodologies, diversity has still no clear bonds with combined

performance and therefore is not used in applications. These and many other related

problems prevent a full explanation of the mechanisms ruling classifier fusion and

hence limiting our ability to predict and control the behaviour of the combined

performance so much appreciated in commercial applications.

One of the research project goals is the establishment of the relationship between

the performance of the combined system and various properties of multiple classifier

system (MCS). Diversity, identified as a promising descriptive tool, is thoroughly

investigated and the role it plays in classifier fusion examined in an attempt to

1

Chapter 1. Introduction 2

provide diagnostic tools invaluable during a complex process of designing MCS. All

these questions, doubts and challenges are to be addressed in this thesis within a

general framework of diversity analysis for combined pattern recognition.

1.1 Background

Research efforts dedicated to supervised pattern recognition invariably focussed on

further improvement of the recognition rate, have recently been undergoing a sig-

nificant change. The traditional continuous development of more and more so-

phisticated classification models turns out to provide some benefits only in specific

problem domains where some prior background knowledge or new evidence can be

exploited to further improve classification performance. In general however, re-

lated research proves that no individual method can be shown to deal well with

all kinds of classification tasks [148], [28], [7], [137]. Realisation of the inevitable

imperfections of individual classifiers catalysed the emergence of a new model de-

sign strategy assuming combining different classifiers as a main source of perfor-

mance improvement [137], [7], [158]. Classifier fusion methodology exploded recently

into a wide variety of models some of which have been shown to be very success-

ful [148], [28], [7], [137], [80], [81], [15], [165], [60], [158], [71], [53], [55], [65], [58], [27].

Although spectacular improvement of the recognition rate in combined pattern

classification systems has been demonstrated on a number of problem domains, the

explanation of that phenomenon remains vague and very general. On one hand

the process of classifier fusion is explicit and definable. The complexity of in-

dividual classification models limits however the interpretability of the combined

performance behaviour in terms of various individual and relational characteristics

exhibited among classifiers. Transparency of pattern recognition systems becomes

a crucial property in commercial or industrial applications, where due to security

or revenue maximisation, the risk associated with employing a highly complex com-

posite classification system is high and has to be minimised. To this end various

attempts at controlling or diagnosing the behaviour of combined performance have

shown only partially positive and still confusing results [138], [164], [88], [122]. Re-

flection of that fact can be found in safety critical pattern recognition systems, where

simple yet well explained and easily controllable techniques commonly based on try

all and choose the best model are preferred [139].

Research efforts towards explanations in combined classification systems focus

on two approaches. One way is to analyse the specific combining method and use

its characteristics backpropagated into relations among classifiers to model or di-


rectly measure combining performance or its improvement [131], [128], [75], [164].

The other method assumes the existence of underlying diversity among classifiers,

which together with the individual classifier performances determine in some im-

plicit way the combined performance. In this interpretation the notion of diver-

sity embodies the concepts of team strength or complementarity among classifiers

and is believed to have a key impact on combining performance [126], [89], [140].

There is though a number of uncertainties associated with diversity on both, the

conceptual and practical levels. First, it is not clear whether diversity as a con-

cept is independent of individual performances and the combining method used.

These doubts directly translate into problems of measuring diversity in a consistent

manner independent of a number of variable parameters of the multiple classifier

system [128], [126], [87], [140]. Another aspect which complicates the issue even

more is a doubt whether diversity should be considered together with the combiner

and its properties, or should it consistently represent a fixed concept ignoring any

bonds with the fusion system. In other words it is not clear whether diversity should

be perceived universally as an independent concept or if it should be biased by the

specific features of the particular combiner. The latter option would be particularly

justified by the diagnostic and control requirements so that diversity, being tuned

to the combiner, could be applied during the design process. Both models of diver-

sity pursuing explanations of the performance behaviour in combined classification

systems form the main theme investigated in this thesis. Extensive experimental

work attempts to justify the practical applicability of diversity during the process

of composite classifier design and accordingly verify the usefulness of the diversity

concept for combining classifiers.

1.2 Project description

The overall goal of the project is to explore the multi-modal concept of classifiers

diversity broken down into various interdependencies among individual models from

classifier ensembles and investigate its explanatory strength in the context of per-

formance variability of the combined system. Although the notion of diversity is

approached on many distinct platforms including perception, representation and

measuring, particular emphasis is put on the potential applicability of diversity

analysis in the process of designing multiple classifier systems. The research intends

to exploit diversity as a diagnostic tool capable of guiding or at least indicating

which classifier ensembles are most likely to show good combined results as opposed

to those classifiers which if combined do not show any improvement or even lead to


deterioration of the performance compared with the individually best model.

The initial investigations revealed a number of strategies for tackling diversity in

relation to combining classifiers. However, due to a large size and complexity of the

problem, the scope of the project is technically narrowed down to the phenomena

observed and investigated only for majority voting (MV) combiner operating on the

ensemble of different classification models. Within this setup the notion of classifiers

diversity is targeted in three different contexts:

Exploratory investigations of the behaviour of majority voting performanceand its limits - looking at the mechanisms responsible for performance im-

provement in multiple classifier systems.

Analysis of the relation between combined performance behaviour and vari-ous models of diversity - trying to identify the bonds between the two and

investigate the possibilities of their enhancement.

Diversity in classifier selection - experimental study attempting to apply di-versity measures as effective selection criteria capable of extracting optimal

ensembles of classifiers.

These three issues consistently build up into a comprehensive evaluation of the role

diversity plays in the combined pattern recognition systems and directly justify the

usefulness of the diversity analysis in designing multiple classifier systems.

1.3 Original contributions

This section provides a brief summary of the major original findings arising from

the study. It serves both to provide a clearer presentation throughout later chapters

and apparent specification of the thesis contributions to the field. The study has

been summarised in a number of peer reviewed publications [125], [130], [126], [131],

[128], [129], [132], [127] encompassing both theoretical and experimental material

realising the project goals. Contributions concern three problem domains following

the investigative strategy of the project as briefed in the previous section and are

summarised in a following list:

Proposition of a new systematic order and terminology describing in a uniformmanner wide family of classifier fusion systems, Section 2.4, [125].


Introduction of the error distribution based analysis of majority voting perfor-mance behaviour for large number of differently performing classifiers, Section

3.2.2, Section 3.2.3, [130].

New simple form of the ensemble extendibility condition for independent clas-sifiers, Section 3.2.4, [130].

Parametric analysis and extensive visualisation of majority voting error limits,Section 3.3, Section 3.3.3, [130].

Definition of new patterns of boundary distribution of classifier outputs definedfor the full range of mean classifier error, [0,1], and proposition of their stable

alternatives justified by analysis of classifier margins, Section 3.3, [130].

Definition of a multistage organisation with majority voting system and pre-senting the effect of error limits widening along with conditions necessary for

its occurrence, Section 3.4, [130].

Extensive analysis of the correlation between majority voting error and variousbinary operating diversity measures, Section 4.2, [126].

Definition of the asymmetry property of diversity measures and showing itsimportance in the correlation analysis, Section 4.2, [126].

Definition of the Fault Majority measure - as an example of a measure opti-mised to the combiner, Section 4.3.1, [126].

Presentation of the set-based analysis of error coincidences and using it forrapid extraction of error coincidences among classifiers, definition of new mea-

sures of diversity and decomposition of majority voting error, Section 4.3, [131].

Definition of a robust Relative Error measure promoting combiner specificapproach to diversity measures, justified experimentally, Section 4.4, [128].

Development of a new methodology for pattern classification based on theconcept of information fields inspired from physical potential fields, [129], [132].

Definition of gravity and electrostatic models of classification and showingtheir good performance in terms of both recognition rate and diversity, [132].

Development and evaluation of a number of search algorithms applied forclassifier selection with various selection criteria, Section 5.2, [127].


Evaluation of diversity measures as a classifier selection criteria, Section 5.2.

Proposition of the network-based processing of the population of combinationsof classifier outputs, Section 5.3.

Development of a multilayer selection-fusion model, analysis of its structureoptimality and extensive evaluation showing improvement of the generalisation

performance, Section 5.3.

1.4 Organisation of the thesis

Chapter 2 outlines the context and theoretical background for this work. It provides

a general overview of pattern recognition methodology, and on grounds of advances

in information fusion illustrates the state of the art in multiple classifier systems.

The material presented in the next three chapters covers the original contributions

summarised in the previous section.

Chapter 3 attempts to uncover various mechanisms driving performance im-

provement in majority voting. Parametric analysis of individual error coincidences

is formalised and used to explain several aspects of the behaviour of MV error and its

limits. In the second part majority voting is presented in a multistage organisation

setup and its interesting effects on the combined performance are discussed.

The next chapter summarises various models and perceptions of the diversity and

addresses the problem of its representation and measurement. The relation between

diversity among classifiers and the performance of majority voting is investigated

experimentally and the results compared with exhaustively extracted optimal en-

sembles of classifiers. The conclusions drawn from these experiments are directly

exploited in promoting the new form of diversity, conceptually biased by the defi-

nition of the combiners performance. Supported by comprehensive analysis of the

error coincidences, the combiners specific diversity is presented and embodied into

a series of novel measures ultimately leading to the convergence between the concept

of diversity and combined performance.

Chapter 5 focuses on the application side of diversity measures presenting ex-

tensive experimental results of classifier selection guided by various measures of

diversity and performance. Among many different selection algorithms and crite-

ria the best setup is analysed and expanded into a multilayer network preventing

selection overfitting and improving the generalisation properties of the system.

The concluding chapter summarises the main findings of the project and indicates

directions for further research.

Chapter 2

Overview of pattern recognition

and classifier fusion

2.1 Introduction

In the early developments of automated pattern recognition systems, inspirations

were always being found in the biological world, where we humans exhibit a remark-

able blend of recognition skills. Humans seem to be more efficient in solving many

complex, especially vaguely specified classification tasks owing to the natural ability

to cope with uncertain/ambiguous data coming in variety of forms from different

sources. In some more specific applications like fingerprint recognition [118] or DNA

sequence identification [101], automated pattern recognition systems immensely out-

performed humans mainly due to the enormous size of the data and interdependency

between the factors to be analysed and processed. It seems then, that a successful

pattern recognition system has to exhibit both the efficiency of a biological cogni-

tive system and the processing power of modern computing systems. Indeed in cases

like vision or speech recognition, understanding biological cognitive mechanisms and

adopting them on fast computer systems would open enormous capabilities. How-

ever, there are also pattern recognition problems like DNA identification [101], gas

detection [62], infra-red target tracking [7], which not only remain far beyond our

cognitive and processing capabilities but also require specific mathematical models

and sophisticated hardware sensing of a type unreachable for humans. In general,

there is no single strategy or recipe for successful pattern recognition systems. In-

stead there is a rich variety of individual problem dependent methods dealing well

with very specific problems but failing to generalise well to other tasks.

In parallel to the efforts at improving individual pattern recognition models, a

7

Chapter 2. Overview of pattern recognition and classifier fusion 8

completely new trend emerged recently attracting a lot of scientific attention. Fol-

lowing the advances made in electronics and computer science, pattern recognition

had been undergoing a rapid improvement encouraged by gradually relaxing com-

plexity constraints. The pioneering efforts of Dasarathy [22], but also these of many

other works reviewed in [22], initiated an entirely new branch of pattern recognition -

classifier fusion. The inspiration can be traced back as far as ancient Greece, citizens

of which were the first who reached decisions collectively in order to improve their

quality and minimise the risk of individual failures [116]. Omnipresent in current so-

cieties, group decision making indeed proves to secure well balanced decisions crucial

for the stability and prosperity of todays democracies [50], [134], [9]. In a similar

fashion it has been noticed that applying multiple classification models for the same

task and combining their results could lead to spectacular performance improve-

ments compared with the individual best model [158], [22], [121], [58]. It turned out

that fusion may in fact be successful not only if applied for classifier decisions but at

other stages of the classification cycle starting from data fusion [49] [7], [54], [32], [36]

through feature (processed data) fusion [7], [68], [35], [33] up to the aforementioned

classifier fusion [22], [137], [7]. Section 2.3 discusses in detail various issues related

to information fusion.

These findings triggered development of very complex systems where mixtures

of fusion, combining and selection of partial evidence applied to input data, features

or classifier outputs cover uncountable variations and structures of potential pat-

tern recognition systems. It is therefore not surprising that due to the potentially

large variety of the combined pattern recognition designs, there is still no consis-

tent and commonly agreed taxonomy naming and categorising different combining

techniques. Some recent attempts at a very general classification of fusion methods

into coverage and decision optimisation techniques [57] assume that either the clas-

sifiers or the combiner is to be optimised while the other remains fixed. However,

the state of the art in classifier fusion seems to be much wider and more complex

with the multiplicity of classifiers used in many different ways beyond these two

mentioned types of combining. One example could be a modular decomposition

system where the single best or a number of best classifiers are applied to differ-

ent classification subtasks controlled by the classifier selection process [137], [122].

Moreover, combining classifiers involves a number of other aspects including archi-

tectures for combining, training abilities of the fusion operator and may relate also

to fusion on different levels of abstraction within the classification cycle [7]. On top

of that, all different styles, paradigms and properties of combining may appear at

the same time during the design process. For example there is nothing wrong with


coverage and decision optimisation methods being combined together. Facing this

pudding of varieties, rather than contributing to the overall non-specificity in the

field we present the classifier fusion as scheme uniformly described by three distinct

properties and show in Section 2.4.5, that this noncompetitive approach covers all

different models and designs of combining.

The high complexity and hence the computational power demands of classifier

fusion systems is one of the reasons they are not widely applied yet. Among other

reasons, the major problem is the lack of interpretability of complex systems. Un-

fortunately these drawbacks usually eliminate such fusion systems from industrial

applications - where suddenly emerging problems require a quick explanation and

fix, while the system performance should be predictable and stable. Although com-

plexity can be increasingly dealt with and there is a prospect for the stability gain,

there is very little one can usually do with systems that occasionally do not work,

or work beyond ones control. The major issue addressed in this thesis - diversity

among classifiers, is believed to provide theoretical and practical answers accounting

for the diagnostic and explanative capabilities of diversity in the context of classifier

fusion.

The term diversity related to combining evidences originated from the software

engineering domain [29], [73], [99], [112], where the reliability of conventionally coded

programs was improved by combining independently written versions of the same

algorithm. Appearing under many names in the literature, diversity is believed to

be a major source of performance improvement in combined pattern recognition

[110], [138], [111], [131], [87]. A large variety of representations, models and data

types represent some of the many faces of diversity related to classifiers. In this

thesis the emphasis is put on practical aspects of diversity, the ways it can be

measured, understood and eventually whether it can explain and possibly diagnose

why and when combining classifiers could be an effective alternative to individual

classification models. Detailed conceptual and experimental investigations related

to diversity are undertaken in Chapter 4 and partially in Chapter 5.

2.2 Pattern classification

Pattern recognition is a scientific discipline one of whose goals is to classify objects

into a number of categories called classes. Objects represent compact data units

specific to a particular problem like images, spoken words, handwritten characters

and are in general referred to as patterns. The process of pattern recognition nor-

mally entails a sequence of well separated operations [28]. It begins with collecting


the evidence acquired from various sensing devices. In the ideal situation the data is

low-dimensional, independent and discriminative so that its values are very similar

for patterns in the same class but very different for patterns from different classes.

Raw data rarely satisfies these conditions and therefore a set of procedures called

feature generation, extraction and selection is required to provide a relevant input

for classification system. Data sensing and feature extraction is beyond the scope

of this thesis. It is noted however that the product of these two components of

the pattern recognition design are feature vectors representing the input data for

classification systems.

Given the feature vectors x X provided by a feature extractor, the objectiveof the supervised classification method, the classifier, is to assign the new object x

to a relevant class j , where = {1, ..., C}, based on previous observationsof labelled patterns: XT = {x, } - training data. The overall classification processcan be broken down into four major components: model choice, data preprocessing,

training and testing or evaluation. Evaluation closes the classification part in the

pattern recognition design which then enters the post-processing and overall system

evaluation stage. There is a great flexibility of operation in this last phase of pattern

recognition design. It may just involve risk or reliability analysis, it could be system

tuning aimed at minimising the cost or further context-based optimisation. There

is also space for combining classifiers or in general for processing the outputs from

many classifiers returned from the classification process. The diagram of pattern

recognition design and the subset involving the classification cycle is shown in Figure

2.1.

The major issue treated in this thesis - diversity among classifiers - narrows

down the operational scope to just the two last components of the pattern recogni-

tion design: classification and post-processing. Classification broken down into the

design cycle is presented in the following section, with a particular emphasis put on

the limitations of the individual model implementation. This is followed by a for-

mal definition of classification error pointing out its sources and indicating methods

for its elimination leading to the development of the combined system presented in

Section 2.4.

2.2.1 Classifier design cycle

In the supervised pattern recognition task considered in this thesis, the classifiers

goal is to assign the unlabelled object x to the class label based on the evidence

learned from the labelled training set XT : {xi, j}. Mathematically classifiers


Figure 2.1: Pattern recognition and classification design cycles


represent simply a discriminative function trying to separate classes from each other

in the multidimensional input space. In a general case such function provides class

support vectors: = [w1, ..., wC ], which depending on the classification model may

represent probabilities, fuzzy membership values or any other measures that can be

understood, compared and handled in the post-processing phase. Classification can

be therefore interpreted as a mapping:

D = f([x1, ..., xK ]T ) = [y1, ..., yC ]

T (2.1)

where yj denotes a degree of support for class j estimating the probability P (j|x).The difficulty of the classification problem depends on the variability in the feature

values within the same classes relative to the differences between feature values for

patterns from different classes. Among other phenomena complicating the classi-

fication task, the major contribution is attributed to the lack or incompleteness

of the data, the high complexity of the problem and, above all, noise accounting

for all kinds of randomness in pattern variability that is not due to underlying

model [28], [148]. The performance of a classifier becomes the result of the trade-

off between the conceptual adequacy of the classification model and its complexity

control mechanisms. As mentioned before, the classification process can be seg-

mented into four distinct operations: model choice, data preprocessing, training,

and evaluation.

Model choice

The decision regarding the selection of the classification model is very important

and difficult especially if there is little prior knowledge about the nature of the

problem. Additional difficulties come from a fact that the classification process is to

a large extent unpredictable and quite often nondeterministic which means that the

choice can not be immediately justified. The only effective quantitative feedback

comes from the evaluation of the overall classifier performance which means that

a designer has to come through the whole classification cycle to verify his choice.

Sometimes assumptions made by a classifier match the problem characteristics or

the problem is so specific that there is only one method suitable, in which cases

the choice is straightforward. In general however, with respect to the no free lunch

theorem [28], [156], there is no individual method providing the best solution for all

types of pattern recognition problems. In a typical scenario, given a classification

problem, the designer has typically plenty of different classification models at hand

and optimistically only a rough idea which ones could be the most successful. Unless


there is clear evidence of the model match to the problem, quite trivially a tedious

try all and choose the best approach seems to provide a justifiable strategy. Even

then, due to limited evaluation capabilities, assigning a single classifier to the task

puts the optimality of performance at risk. Another aspect arising from the model

selection stage is the loss of valuable evidence provided by competitive classifiers

ranked just behind the winner. These conceptual and practical difficulties in classi-

fier selection contributed to the development of classifier fusion systems, where all

the complementary evidence and knowledge is jointly incorporated into the decision

process. Further details related to classifier fusion are presented in Sections 2.3 and

2.4.

Data collection and preprocessing

Once the model is chosen the input data are prepared to be passed on to a classifier.

These data are in fact k-component feature vectors of the form: x = [x1, ..., xk]T

returned from the feature extraction stage of pattern recognition design. Individual

patterns represent points in the k-dimensional input space, examples of which are

depicted for two dimensional cases in Figure 2.2.

Although during the feature extraction phase, the data may have already been

preprocessed to enhance its class discriminative power, the choice of the classification

model usually dictates further adjustments. Various types of normalisation are

routinely required. For example to achieve invariance to displacements and scale

changes, one might transform the data so that they have zero mean and unit variance

[148]. Some models may require the data only from the specific range, for example

(0, 1), in which case normalisation has also to be applied [35], [32], [33], [35], [36], [34].

Normalisation may destroy the original data structure if there are some outliers,

hence removal of outliers may be required prior to normalisation [132], [148]. Missing

feature values is another common problem related to the data that has to be treated

to avoid failures [28], [148], [34], [106]. For some complex classifiers the number of

features returned from the feature extraction process may lead to intractability.

Various techniques aiming at reducing the data size may therefore be required.

Applying various data editing or data condensation techniques [18], [83] would reduce

directly the number of patterns while trying to preserve the structure of the data.

Alternatively, data dimensionality may be targeted and methods based on feature

selection [28], principal/independent component analysis (PCA/ICA) [107], [63] or

maximum mutual information (MMI) [119], [149] applied to reduce the number of

dimensions with minimal impact on the discriminatory strength of the remaining


features.

Further processing may be required if a multiple classifier system is to be ap-

plied. The input space may for instance be segmented and the training set effec-

tively split into parts fed to different classifiers like in dynamic classifier selection

(DCS) [41], [43], [40] systems. For the same purpose, the data may be grouped into

many subsets of features and applied separately for building many versions of the

model to be combined [84], [164]. Finally there is yet another reason for data prepro-

cessing prior to classification. Different classifiers may be encouraged to be diverse

by providing as much distinct evidence related to the same problem as possible.

Alongside already mentioned input space partitioning or selecting different feature

subsets, there are also simpler methods like injecting noise or differentiation of ini-

tial conditions [25] and many different linear and non-linear transformations [138]

that could be potentially used to enforce diversity among classifiers.

2 1 0 1 2 3 4 5 6 76

4

2

0

2

4

6

8

(a) artificial, 2-D, 8 classes

1 2 3 4 5 6 7 8 9 10 111

2

3

4

5

6

7

8

9

10

11

(b) artificial, 2-D, 3 classes

Figure 2.2: Two examples of two dimensional datasets.

Training

Training is the actual process of classifier learning. Although this thesis is only

concerned with supervised learning, the training process is a good place to briefly

discuss different learning models [28] as they directly affect the way training is

carried out.

Depending on the availability and reliability of the evidence one can distinguish

three learning strategies: supervised, unsupervised and reinforcement learning. In

supervised learning the classifier is given a labelled training set to build the model on.

It is called supervised as it could be thought of, as the teacher providing the patterns

and their true classes on the basis of which the classifier model learns how to return


an optimal solution to the problem. In some cases training data of known classes

may not be available, which eliminates the availability of a teacher. Such learning

on the basis of unlabelled data is called unsupervised learning. In intermediate

reinforcement learning although the true labels of patterns are not available, the

feedback is given on whether the classifier output is correct or incorrect, without

specifying what is the correct answer.

Classification models are normally fully learnt from labelled pattern examples.

The major fact to be realised is that the number of labelled data is limited and

usually very small and costly to obtain. Another important fact is that these data

have to be also used for performance evaluation. This implies that a part of the

available data has to be left out for testing purposes, which further narrows down

the amount of data to be used for a proper training of the classifier.

Given a set of all available labelled data X:

X : {xi = [x1, ..., xk]T , j } i = 1, ..., N j = 1, ..., C (2.2)

we denote the training set by XT where XT X and note that the remainingdata: XE = X \XT 1 will be used for testing (see Section 2.2.1). Normally the moretraining data is used the more adequately the model reflects a problem and the better

its performance. Some characteristics of the classifier training is captured in the

form of a learning curve, showing the relation between the classifiers generalisation

performance and the size of the training set used to train the model. Figure 2.3(a)

shows examples of such learning curves for three typical classifiers. The examples

present three types of learning behaviour. For the first linear classifier, adding more

training data does not improve its performance as the data are simply highly non-

linear. The decision tree classifier shows the optimal amount of training data above

which it becomes overtrained. The third highly non-linear k nearest neighbour

classifier seems to benefit constantly from adding more training data although at

the level of 400 samples it seems to reach a plateau and adding lots of new training

data does not improve the classifiers performance significantly. What it certainly

does though, is increase model complexity and reduce the size of prospective testing

samples. If the size of the labelled data is seriously limited then some more elaborate

splitting and error estimation techniques are required [154]. Figure 2.3 provides also

visualisation of the three classifiers after the training. For the 2-dimensional problem

it visualises discriminative functions and shows the resulting decision boundaries.

1\ denotes set subtraction operator: A \ B = C C = A B


50 100 150 200 250 300 350 4000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

No of samples

Err

or r

ate

ldc

treec

knnc

(a) learning curves: artificial, 2-D, 8 classes

2

0

2

4

6

64

20

24

68

2

1.5

1

0.5

0

(b) linear discriminant classifier

2

0

2

4

6

64

20

24

68

2

1.5

1

0.5

0

(c) decision tree classifier

2

0

2

4

6

64

20

24

68

2

1.5

1

0.5

0

(d) k nearest neighbours classifier

Figure 2.3: Visualisation of the training process for 3 common classifiers. Plots b,c,dshow superposition of discriminative functions within 2-dimensional feature space.

Testing

The importance of model evaluation stems from the fact that it provides the most in-

formative measure of classifier performance which then could justify its use, leading

to possible optimisation, redesign or elimination if other models show better per-

formance. The common belief that a more elaborate classifier producing complex

non-linear class boundaries is better than simple linear models may not be always

true. Complex models tend to overfit the training data so that although their per-

formance on the training set is usually much better than simple linear models, they

could show very weak performance for new patterns [28]. Data overfitting is a typ-

ical trap for sophisticated systems unless some complexity control mechanisms are

incorporated in the design of such a classifier. It is believed that a model with

well-balanced complexity should perform similarly on the training and testing data

as well as any data from the problem domain [28], [148].

Given the limited amount of training data, the precise estimation of the true


model performance or error rate is quite a challenge. There is no issue if the size

of available training data is huge compared to the number of classes. According

to standard statistical analysis carried out in [154], 1000 testing samples should

provide satisfactory error tolerance of the predicted performance for most of the

cases. Problems start to emerge if there is less or much less data available. Random

multiple splitting into training and testing sets is the simplest method to enhance

the reliability of performance estimation. For smaller testing sets multiple splitting

still holds a high risk that some regions of the input space may be scarcely covered

leading to substantial bias in performance estimate. In such cases multiple cross-

validation procedures show quite satisfactory results [154]. In cross-validation, the

testing set is rotated over exclusive subsets exhaustively covering the whole dataset.

The extreme cross-validation with a rotation of only a single pattern used for testing

is called leave-one-out [154] and is preferred whenever its application is computa-

tionally tractable. For sample sizes smaller than 50 leave one-out can be supported

by boot-strapping [154], [28], generating a test set by sampling with replacement from

a training set. More precise guidelines for the use of true performance estimation

methods depending on the size of the testing set can be found in [154].

A final comment relates to the combined systems where the combiner may require

individual classifier performance estimates to decide which ones to combine. In

such a case apart from the proper training and testing sets used for individual

performance estimations, there is a need for additional validation set to be used for

the estimate of combiner performance. Normally, the combiner could be perceived

as a more general classifier, which would require a separate set for building the

combination model and separate for testing its performance. However, separating

additional set from the overall classification dataset would further limit training and

evaluation capabilities of individual classifiers.

Due to the large number of classifiers and datasets considered throughout the

experimental parts of this thesis, estimation of individual performances is based on

random multiple splitting. The estimations of combiner performance is based on

the same testing set as the one used for evaluation of individual classifier perfor-

mances. These choices have been taken to maintain simplicity and uniformity of the

experimental results and to ensure a coherent comparison between individual and

combined performances.


2.2.2 Classification error

Pattern classification incorporates supervised learning mechanisms and therefore

shares a similar description of the model error [28], [137]. The major objective

of supervised learning is to construct a predictor which, given the limited amount

of training data, will be able to estimate a target function T : x y with apossibly minimum error. Excluding artificial data, mapping x y usually reflectsthe real-world learning problem, which is commonly dependent on a large number

of factors. Due to a number of constraints the predictor tries to select only the

minimum number of factors, which contemporarily describe the problem and are

sufficient to give reliable predictions. However the fact that they never cover the

whole knowledge space supporting the solution of the problem, limits the ability of

generating correct outputs according to the following formula:

y = E(y|x) + (2.3)

where E(y|x) represents the expectation operator of y given x and stands for whitenoise. An additional portion of model error stems from the limited, usually small,

training set. Instead of using a whole input space X, which is commonly unknown,

the predictor uses only selected known training data XT for generation of predictions

for unknown data: f(x,XT ) with an unknown level of representativeness related to

x. After this additional constraint all considerations are forced to be targeted at

training dataset XT , which could be additionally split in order to leave out some

part for testing the accuracy of predictions. The total mean squared error of the

model can be now formulated as [39], [137]:

e2f = EXT {[y f(x,XT )]2} = E(2) + EXT {[E(y|x) f(x,XT )]2} (2.4)

Some further algebra results in:

e2f = E(2)

noise

+ E2XT [f(x,XT ) E(y|x)]

bias

+ EXT {[f(x,XT ) E[f(x,XT )]]2}

variance

(2.5)

As it is clear from the equation (2.5) a simple decomposition leads to a separation

of three independent components of model error. The first term is called white noise,

which cannot be reduced unless further evidence is provided. The second term - bias

can be intuitively characterised as a measure of predictors ability to generalise well

once trained. Finally, the third term variance can be similarly interpreted as a


measure of sensitivity of predictor outputs over different training sets. The model

error can be therefore rewritten in the concise form as:

e2f = 2 + B2(f) + V (f) (2.6)

In the classification model the only difference from the general prediction model

comes from the fact that classification operates on assignments to the crisp class

labels j as elements from the set . The individual error of the classifier occurs

thus in the form of picking the wrong class label, not by bias from some true value

measured continuously like in regression problems. The variability of the classifi-

cation outputs requires a specific description leading to slight differences in error

representation compared with the prediction models as shown in 2.5. Considering

classification error within a probabilistic frame of reference, each classifier produces

probabilistic outputs supporting different classes: Di = [p(1), ..., p(C)]T . Denot-

ing by T = arg max[p(j|x)] the true class for a given input pattern x and by fthe classifier choice arising from f = arg max[f(x,XT ) = ], error decomposition

can be reformulated from (2.5) to the following form [137]:

ef = 1 p(T |x)

Bayes error

+ p(f |f, x)[p(T |x) p(f |x)]

bias

+

6=f

p(|f, x)[p(T |x) p(|x)]

spread

(2.7)

The Bayes error appearing in the equation (2.7) in place of noise component

(2.5) forms the lower bound on the classification error and is only a function of

the problem complexity and the available evidence. The bias expresses classifier

goodness in modelling the problem while the spread (equivalent to variance in (2.5))

describes the variability of the model outputs.

While Bayes error component can not be reduced by any means, the remaining

bias and spread error components remain fixed only for individual classifiers. In

multiple classifier systems, the spread component is likely to be reduced by parallel

combining of redundant classifiers [137]. In such case the variability of classifier

outputs is stabilised as a result of applied aggregation [137]. On the other hand,

bias can only be reduced as a result of the better classification model, which can be

potentially achieved by applying modular decomposition of the classification task

and assign different classifiers to the subtasks for which they perform the best [137].

More detailed analysis of the error in combined multiple classifier systems will be

provided in Section 2.4.5 treating about different combining paradigm models.


2.3 Information fusion

Two important facts related to the reality of the end of the 20th century contributed

to the enormous dynamics we observe today in the area of evidence fusion. The first

was the emergence of multi-modal detection systems providing coordinated data

from multiple sensors of different types facilitated by immense information content

from highly developed interconnected information systems [49], [7]. Treating all

types of evidence separately with a single method was an unsuccessful option, lead-

ing to either complex hybridisation of the system or no gain in performance. What

led to the breakthrough was the fusion of distinct evidence on many different levels

from pure data to the decisions of individual experts operating on different parts

of the available evidence [49], [7], [22]. Another important point to note was that

individual classification methods provide alternative knowledge even in the absence

of alternative data. It turned out that even if applied to the same task using the

same data, a joint decision of combined classifiers is potentially more effective than

any one individual [22], [15], [70], [137]. These facts, emerging in an environment

of rapidly growing technology, cheap computational power and exponentially ex-

panding internet resources led to a sudden turn to fusion in the pattern recognition

domain.

Fusion of information can be carried out on many different levels of abstrac-

tion closely connected with the flow of the classification process: data level fusion,

feature level fusion, and classifier fusion [7]. There is little theory about the first

two levels of information fusion. However, there have been successful attempts to

transform numerical, interval and linguistic data into a single space of symmetric

trapezoidal fuzzy numbers [54], [115], and some heuristic methods have been suc-

cessfully used for feature level fusion [7], [68]. Classifier fusion has attracted most

scientific attention and continues to expand under many different names includ-

ing: classifier fusion, combining classifiers, mixture of experts, ensemble systems,

multiple classifier systems, composite classifiers etc [22], [137], [7], [25], [70], [122].

2.3.1 Data fusion

At the basic level of data sensing, the fusion of data from various modalities has been

used to resolve the occlusion problem in vision systems [7]. In another application,

fusion of differently sensed images improved object detection by overlapping many

partially discriminative projections [54]. In [54], [115] a method of combining various

types of data is presented. The proposed new model of data called heterogenous

fuzzy data, incorporates characteristics of real numerical values, confidence intervals


and linguistic information in a single representation. A generic neuro-fuzzy pattern

recognition model in which data can be processed in a generalised form of confidence

intervals has also been proposed in [32], [36]. These studies are supported by the

theory of fuzzy sets details of which can be found in [163], [72], [114]. Emerging from

this fuzzy measures are considered as generalisation of probabilistic measures within

the general theory of evidence [72], and provide various information modelling tools

that can be used in data fusion.

2.3.2 Feature fusion

There is little evidence of the feature fusion in the literature. Fusion on this level

is considered more general compared to the data fusion and often resembles clas-

sifier fusion techniques. Some authors suggest even that the difference between

feature fusion and combining classifier is somewhat arbitrary [7]. It commonly in-

volves combining multidimensional quantitative feature vectors possibly supported

by some qualitative measures. An example of feature fusion has been shown by

Keller and Gader [67] where the data features extracted from Geo-Centers GPR

system have been combined by a fuzzy rule incorporating some shape characteris-

tics of the raw data. Again the improvement obtained in a form of reduction of

false alarms has been observed. Another example of what may be considered a fea-

ture fusion has been proposed in [33], where the combination of multiple versions

of neuro-fuzzy classifiers is performed at the classifier model level. In this approach

hyperbox fuzzy sets representing clusters of data in different models are combined.

The resulting classifier complexity and transparency is comparable with classifiers

generated during a single cross-validation procedure, while the improved classifica-

tion performance and reduced variance is comparable to the ensemble of classifiers

with combined decisions.

2.3.

dissertation classiﬁer diversity in combined...

Documents