dissertation classifier diversity in combined...

228
DISSERTATION Classifier Diversity in Combined Pattern Recognition Systems A Thesis Presented to the School of Information and Communication Technologies University of Paisley In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy By Dymitr Ruta MSc, Eng Applied Computational Intelligence Research Unit University of Paisley, Scotland September 2003

Upload: hathuan

Post on 17-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • DISSERTATION

    Classifier Diversity

    in Combined

    Pattern Recognition Systems

    A Thesis

    Presented to the School of Information and Communication Technologies

    University of Paisley

    In Partial Fulfillment

    of the Requirements for the Degree

    Doctor of Philosophy

    By

    Dymitr Ruta MSc, Eng

    Applied Computational Intelligence Research Unit

    University of Paisley, Scotland

    September 2003

  • - To Ola and Robert, my Mum and Dad -

    i

  • Abstract

    This work covers explorative investigations of diversity in relation to multiple

    classifier systems (MCS). The notion of diversity emerged as an attempt to explain

    the sources of considerable performance improvement that can be observed when

    classifiers are combined. At this early stage of development of a young and promis-

    ing discipline of classifier fusion, the decision as to whether to choose the best, or

    combine, and if so then which models, is unclear. With respect to these problems,

    the role of diversity as an explanative and diagnostic tool guiding optimal design of

    a multiple classifier system is addressed and thoroughly examined in three different

    contexts:

    majority voting performance and its limits;

    relation between diversity measures and combined performance;

    and classifier selection guided by various criteria

    In the case of majority voting (MV), the behaviour of combined performance

    is investigated and tracked back to the specific distributions of classifier outputs in

    an attempt to extract classifier characteristics that could explain variability of com-

    bined performance. Indepth parametric analysis of the impact of classifier outputs

    distribution and various parameters of MCS on combined performance is conducted.

    The results provide clear and comprehensive explanations of what makes majority

    voting work, facilitated by a number of novel findings related to MV error limits,

    extendibility of MCS and optimal patterns of outputs distribution.

    Given a clear picture of the mechanisms driving performance gain in combined

    systems, various models of diversity are evaluated in terms of their ability to ex-

    plain the variability of combined performance and/or its improvement over indi-

    vidual classifiers. Complex co-involvement of individual performances and various

    relationships among classifier outputs in their relation with MV performance re-

    vealed dissonance between traditionally perceived diversity and the performance of

    majority voting. The constructive conclusions from that analysis laid grounds for

    the development of a new strategy for constructing diversity measures that are opti-

    mised with respect to the combiner. To that end, two novel diversity measures have

    been proposed using systematic and set based analysis, and their advantages over

    existing diversity measures have been demonstrated experimentally. These promis-

    ing results together with the concept of ambiguity adopted from regression problems

    ii

  • provided an inspiration for extending the strategy of modelling the improvement of

    combiner performance, up to using directly combined performance in order to satisfy

    requirements set for the diversity measures. It is demonstrated and experimentally

    justified that such combiner specific perception of the diversity is more suitable for

    the applications to diagnostics and design of MCS.

    Classifier selection represents the ultimate test of the usefulness of diversity in

    practical applications of multiple classifier systems. Complex though precise per-

    formance driven classifier selection methods are confronted with simple diversity

    guided selection techniques. Extensive experimental work with a number of novel

    searching algorithms is carried out and its results used for development of an orig-

    inal multistage organisation system employing both classifier fusion and selection

    on many layers of its structure. A new mechanism of processing a number of best

    classifier combinations at each layer is finally proposed and its positive effects on

    the generalisation ability of the whole system is demonstrated over a number of

    standard datasets.

  • Declaration

    The work contained in this thesis is the result of my own investigations and has

    not been accepted nor concurrently submitted in candidature for any other award.

    Copyright c2003 Dymitr Ruta

    The copyright of the thesis belongs to the author under the terms of the United

    Kingdom Copyright Acts as qualified by the University of Paisley. Due acknowl-

    edgements must be made of the use of any material contained in, or derived from,

    this thesis. Power of discretion is granted to the depository libraries to allow the

    thesis to be copied in whole or in part without further reference to the author. This

    permission covers only single copies made for study purposes, subjects to normal

    conditions of acknowledgement.

    iv

  • Acknowledgments

    I am deeply indebted to my supervisor Dr Bogdan Gabrys for his courage to

    take me for his first PhD student and attack very young and uncertain discipline

    of combined pattern recognition systems. With his passion in intelligent systems

    combined with the emerging potential from the novel area of information fusion he

    encouraged me to join the battle for alternative improvement of the pattern recogni-

    tion systems - classifier fusion. His invaluable gift of filtering out and proposing good

    ideas, efficient brain-storming sessions were important factors stimulating successful

    accomplishment of this thesis. Full credit goes also to him for the establishment of

    the financial support for the whole project.

    The stimulating discussions with my second supervisor Prof. Colin Fyfe, and

    also his great generosity in validating my participations in a number of research

    conferences are also gratefully acknowledged.

    It is a pleasure to express my gratitude to all the members of our Applied Com-

    putational Intelligence Research Unit in particular to Lina Petrakieva for their ever-

    lasting willingness to dispute computational, mathematical and philosophical issues

    and for the excellent ambience in which doing research was a real pleasure.

    Supplementary, the input and interest of the Pattern Recognition Group of the

    Delf University of Technology led by Robert Duin who developed Matlab Pattern

    Recognition Toolbox (PRTools) was of great help.

    I have profited from the numerous exchanges of views and e-mails with several

    experienced colleagues, actively participating the series of International Workshops

    on Multiple Classifier Systems.

    Finally, I wish to send hugs and kisses to my wife Aleksandra for several private

    reasons, but particularly for her constant engagement with my son Robert, which

    was a necessary condition for this dissertation being done.

    v

  • Contents

    Contents vi

    List of Figures ix

    List of Tables xv

    Abbreviations xvi

    1 Introduction 1

    1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Overview of pattern recognition and classifier fusion 7

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 Pattern classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.1 Classifier design cycle . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.2 Classification error . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.3 Information fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3.1 Data fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3.2 Feature fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.3.3 Decision fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.3.4 Classifier outputs . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.4 Classifier fusion systems . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.4.1 Combining based on classifiers outputs . . . . . . . . . . . . . 29

    2.4.2 Combining based on training style . . . . . . . . . . . . . . . . 34

    2.4.3 Coverage vs decision optimisation . . . . . . . . . . . . . . . . 35

    2.4.4 Decomposition approaches . . . . . . . . . . . . . . . . . . . . 36

    2.4.5 Properties of classifier fusion . . . . . . . . . . . . . . . . . . . 37

    vi

  • Contents vii

    3 Combining classifiers by majority voting 46

    3.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2 Combining independent classifiers . . . . . . . . . . . . . . . . . . . . 50

    3.2.1 Bernoulli model . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.2.2 Relaxation of the equal performance assumption . . . . . . . . 51

    3.2.3 Parametric performance analysis . . . . . . . . . . . . . . . . 53

    3.2.4 Beneficial system extendibility . . . . . . . . . . . . . . . . . . 55

    3.3 Error limits for dependent classifiers . . . . . . . . . . . . . . . . . . . 60

    3.3.1 Patterns of boundary error distribution . . . . . . . . . . . . . 61

    3.3.2 Stable boundary error distributions . . . . . . . . . . . . . . . 64

    3.3.3 The limits of majority voting error . . . . . . . . . . . . . . . 67

    3.4 Multistage organisations . . . . . . . . . . . . . . . . . . . . . . . . . 70

    3.4.1 Optimal distribution of outputs for MOMV . . . . . . . . . . 72

    3.4.2 Optimal permutation . . . . . . . . . . . . . . . . . . . . . . . 75

    3.4.3 Optimal structure . . . . . . . . . . . . . . . . . . . . . . . . . 76

    3.4.4 Error limits for MOMV . . . . . . . . . . . . . . . . . . . . . 77

    3.5 Performance stability of majority voting - experimental insight . . . . 79

    3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    4 The notion of diversity 86

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    4.1.1 Software diversity . . . . . . . . . . . . . . . . . . . . . . . . . 87

    4.1.2 Classifier diversity . . . . . . . . . . . . . . . . . . . . . . . . 91

    4.1.3 Perception of diversity . . . . . . . . . . . . . . . . . . . . . . 93

    4.2 Measuring diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    4.2.1 Pairwise diversity measures . . . . . . . . . . . . . . . . . . . 96

    4.2.2 Non-pairwise diversity measures . . . . . . . . . . . . . . . . . 97

    4.2.3 Diversity measure properties . . . . . . . . . . . . . . . . . . . 99

    4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    4.3 Analysis of error coincidences for majority voting . . . . . . . . . . . 104

    4.3.1 Error distributions . . . . . . . . . . . . . . . . . . . . . . . . 108

    4.3.2 Set representation of coincident errors . . . . . . . . . . . . . . 112

    4.3.3 Relations with majority voting . . . . . . . . . . . . . . . . . . 119

    4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

    4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    4.4 Combiner specific diversity . . . . . . . . . . . . . . . . . . . . . . . . 128

    4.4.1 Usefulness of diversity . . . . . . . . . . . . . . . . . . . . . . 131

  • Contents viii

    4.4.2 Relative error measure . . . . . . . . . . . . . . . . . . . . . . 131

    4.4.3 Complexity reduction . . . . . . . . . . . . . . . . . . . . . . . 133

    4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    4.4.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . 137

    5 Classifier selection 141

    5.1 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    5.1.1 Static vs dynamic selection . . . . . . . . . . . . . . . . . . . . 144

    5.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    5.1.3 Selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . 146

    5.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

    5.2.1 Heuristic techniques . . . . . . . . . . . . . . . . . . . . . . . 148

    5.2.2 Greedy approaches . . . . . . . . . . . . . . . . . . . . . . . . 149

    5.2.3 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . 150

    5.2.4 Experimental investigations . . . . . . . . . . . . . . . . . . . 154

    5.3 Multistage selection-fusion model (MSF) . . . . . . . . . . . . . . . . 161

    5.3.1 Network of outputs . . . . . . . . . . . . . . . . . . . . . . . . 164

    5.3.2 Analysis of generalisation ability . . . . . . . . . . . . . . . . . 165

    5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    6 Conclusions 171

    6.1 Justification for the line of research . . . . . . . . . . . . . . . . . . . 171

    6.2 Major findings and contributions . . . . . . . . . . . . . . . . . . . . 172

    6.3 The role of diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

    6.4 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    A Datasets and classifiers used in Experiments 180

    A.1 Description of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    A.2 Description of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 185

    B Generation of classification outputs 191

    B.1 The training methodology . . . . . . . . . . . . . . . . . . . . . . . . 191

    B.2 Testing individual classifiers . . . . . . . . . . . . . . . . . . . . . . . 192

    Bibliography 195

  • List of Figures

    2.1 Pattern recognition and classification design cycles . . . . . . . . . . . 11

    2.2 Two examples of two dimensional datasets. . . . . . . . . . . . . . . . 14

    2.3 Visualisation of the training process for 3 common classifiers. Plots

    b,c,d show superposition of discriminative functions within 2-dimensional

    feature space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4 Operational scope of fusion in combining classifiers . . . . . . . . . . 23

    2.5 Classifier outputs. Transferability of one type into another (top).

    Different soft measures and their associations (bottom) . . . . . . . . 27

    2.6 Training ability of the fusion operator. . . . . . . . . . . . . . . . . . 35

    2.7 Different variations of optimisation relations among data (D), classi-

    fiers (C) and fusion operator (F). Greyed examples represent optimi-

    sation models not yet designed. . . . . . . . . . . . . . . . . . . . . . 43

    2.8 Combining architectures. Different models of decision processing

    (top). Decision aggregation models - comparison between organi-

    sation and network (bottom) . . . . . . . . . . . . . . . . . . . . . . . 45

    3.1 Discrete error distribution with normal distribution approximation.

    15 independent classifiers have been used with 40% error each. Shaded

    bars refer to errors in majority voting sense. The majority voting er-

    ror rate corresponds to the sum of all shaded bars. . . . . . . . . . . . 53

    3.2 Normalised continuous error distribution for 15 independent classi-

    fiers with 40% error each. Shaded area refers to majority voting error

    rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.3 A family of normalised continuous error distributions for increasing

    number of classifiers with the same individual error rates of 40%.

    Decreasing shaded area corresponds to reducing majority vote error. . 56

    ix

  • Contents x

    3.4 Variability of the normalised variance and its effect on majority voting

    error. The continuous line represents the maximum variance limit

    subject to fixing the mean and the number of classifiers. The surfaces

    depict random variability of the normalised variance presented as a

    function of the normalised mean error rate and with correspondence

    to majority voting error. . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.5 The relationship (3.23) between majority voting error and error rates

    e2 and e3 of a pair of classifiers added to a single classifier with error

    rate e1 = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.6 Extendibility curves for different errors of a single classifier. Dashed

    lines limit the area corresponding to individual errors of joining clas-

    sifiers e1, e2 greater than error e1 but producing MV error lower than

    e1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.7 Discrete error distributions for Iris, Biomed, and Chromo datasets

    classified by 15 different classifiers (see Appendix A for details of

    datasets and classifiers). Shaded bars correspond to errors in majority

    voting sense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    3.8 Visualisation of the distribution of success (DS) and failure (DF) for

    Iris, Biomed, and Chromo datasets classified by 15 different classifiers

    (see Appendix A for details of datasets and classifiers). Shaded bars

    correspond to errors in majority voting sense. . . . . . . . . . . . . . 64

    3.9 Visualisation of stable distributions of success and failure for Iris,

    Biomed, and Chromo datasets classified by 15 different classifiers (see

    Appendix A for details of datasets and classifiers). Shaded bars cor-

    respond to errors in majority voting sense. . . . . . . . . . . . . . . . 67

    3.10 Majority voting error limits presented as a function of the number

    of classifiers (M = 3 : 99) and mean classifier error rate. Dotted

    lines in the 2-D projection (b) represent independent MV error and

    correspond to the internal surface in 3-D plot (a). . . . . . . . . . . . 69

    3.11 Multistage organisation with 15 classifiers and structure S15 = (5, 3).

    The outputs from the classifiers are permutated and passed to layer 1.

    At each layer majority voting is applied to each group and the outputs

    are passed on to the next layer until the final output is obtained. . . . 71

    3.12 Multistage organisation with 27 classifiers and structure S27 = (3, 3, 3).

    The first four rows illustrate examples of optimal permutations of out-

    puts for given structure. Note that as little as 8 out 27 1s at the first

    layer can propagate the correct decision up to the final layer. . . . . . 72

  • Contents xi

    3.13 Majority voting error limits for MOMV presented as a function of

    the number of classifiers (M = 3 : 2187) and mean classifier error

    rate. Dotted lines on the 2-D projection (b) represent independent

    MV error and correspond to the internal surface in 3-D plot (a). . . . 78

    3.14 Majority vote errors observed for different boundary error distribu-

    tions expressed as a function of mutation rate and mean classifier

    error. Plots (a)-(f) correspond to DS, DF, SDS, SDF, DSMOMV ,

    DFMOMV respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    3.15 Differences among majority voting errors for different boundary er-

    ror distributions expressed as a function of mutation rate and mean

    classifier error. Plots (a),(b) correspond to DS-SDS, DF-SDF, plots

    (c),(d) show the differences between DSDSMOMV , DFDFMOMV ,and plots (e),(f) show the differences between SDS DSMOMV ,SDF DFMOMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4.1 Venn diagrams visualising the concept of diversity among classifiers.

    Classifiers - the grey thin-lined circles - are trying to estimate the

    true target classification function T - empty thick-lined circle. . . . . 92

    4.2 Diagrams depicting relationship between diversity measures and (a)

    MVE, (b) MVI. The position of each cell determines the correspond-

    ing diversity measure (colums) and the dataset (rows) for which the

    analysis was carried out. The points in each cell depict a depen-

    dence between diversity measure and MVE (a), MVI (b) obtained

    for all combinations of 3 out of 15 classifiers. Details of datasets and

    classifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . 105

    4.3 Diagrams presenting correlation coefficients between diversity mea-

    sures and (a) MVE, (b) MVI. Fields in a grid correspond to various

    measures and datasets as in 4.2. The darker the field the higher cor-

    responding correlation coefficient. The bars underneath the diagrams

    depict the correlation coefficients averaged along all datasets. Details

    of datasets and classifiers are provided in Appendix A. . . . . . . . . 106

    4.4 Averaged evolution of the correlation coefficients between diversity

    measures and (a) MVE, (b) MVI. The graphs show the average corre-

    lation coefficient measured for all combinations of 3,5,...,13 classifiers

    from the ensemble of 15 classifiers. Details of datasets and classifiers

    are provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . 107

  • Contents xii

    4.5 Discrete error distributions presented for the ensembles of 15 clas-

    sifiers on 27 datasets. The shapes of error distributions (bars with

    continuous line joining their tops) are compared with the equivalent

    distributions for independent classifiers (continuous lines). Details of

    datasets and classifiers are provided in Appendix A. . . . . . . . . . . 109

    4.6 Error distribution (thick line) decomposed into 15 partial error distri-

    butions (thin lines) corresponding to 15 classifiers applied to Chromo

    dataset. Details of datasets and classifiers are provided in Appendix A.111

    4.7 Relationship between fault majority measure (FM) and the majority

    voting error obtained for the combinations of 3 out of 15 classifiers

    over 4 typical datasets. For comparison the same plots have been

    obtained for MVE, F2 and ME measures analysed in Section 4.2.4.

    Correlation coefficient c is included for each graph. . . . . . . . . . . 113

    4.8 Visualisation of a set representation of coincident errors. (A) Binary

    outputs from 3 classifiers (0-correct, 1-error). (B),(C) Venn Diagrams

    showing all mutually exclusive subsets. (D) Venn Diagram with the

    indices of samples put in the appropriate subsets positions. . . . . . . 114

    4.9 Venn Diagrams for more than 3 classifiers. (A) 5 congruent ellipses.

    (B) 6 triangles. (C) 7 symmetrical sets - Grunbaum construction.

    (D) bipartite plot of 8 sets - Edwards construction. See [124] for

    further details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    4.10 Two types of error coincidences for the classifier D1 and D3 of the

    ensemble {D1, D2, D3}. (A) An example of error indices distribution.(B) General coincidences CG({D1, D3}) = {3, 5, 6}. (C) Exclusivecoincidences CE({D1, D3}) = {6}. . . . . . . . . . . . . . . . . . . . . 117

    4.11 Collection generation. A: Algorithm. B: Visualisation of the collec-

    tion generation process. . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    4.12 Graphs associated with Venn Diagrams. A: An ordered graph of

    exclusive coincidences for 3 classifier sets. B: Unordered graph for

    Edwards construction of 5 sets. To order the graph, all vertices have

    to be directed towards lower order coincidence. . . . . . . . . . . . . . 119

    4.13 Evolution of correlation coefficients along different levels of GC. Cor-

    relation coefficient were measured between MVE and GC grouped in

    series of 3,5,7,9 out of 11 classifiers for 8 considered datasets. . . . . . 124

  • Contents xiii

    4.14 Evolution of correlation coefficients along different levels of EC. Cor-

    relation coefficients were measured between MVE and EC levels grouped

    in series of 3, 5, 7, 9 out of 11 classifiers for 2 representative datasets

    showing typical patterns of the relationship observed. . . . . . . . . . 125

    4.15 Evolution of correlation coefficients between MVE and type 1 sum

    (from 1st to kth of GC levels presented as a function of a number

    of GC levels taken to the sum (shown in bold lines). For compari-

    son, correlation curves of the individual GC levels are also shown in

    thin lines. Plots are presented for 4 datasets corresponding to most

    representative patterns of the relationship observed. . . . . . . . . . . 125

    4.16 Evolution of correlation coefficients between MVE and type 2 and

    3 sums of coincidence levels shown as a function of the number of

    levels taken to the sum. A: type 2 sum (from kth to M th level) of

    EC levels, (shown in bold lines). For comparison, correlation curves

    of the individual EC levels are also shown in thin lines. B: type 3

    sum of GC levels (bold lines) with correlation curves of the individual

    GC levels shown in thin lines. Details of datasets and classifiers are

    provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . 126

    4.17 Illustration of the importance of correlation coefficients for classifier

    selection in the example of the relation between majority voting error

    and general coincidence levels of 3 out of 11 classifiers applied for the

    Liver dataset. (A) Relation of the first general coincidence levels. (B)

    Relation of the second general coincidence levels. (C) Relation of the

    sum of the first and second general coincidence level with majority

    voting error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    4.18 Graphical interpretation of the RE in two versions: with E0 as in-

    dependent majority voting error 4.18(a), E0 denoting mean classifier

    error 4.18(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

    4.19 Linear regression of the normalised higher levels of general coinci-

    dence calculated as a result of the 11 classifier system applied to

    some typical real-world datasets. (a) The LGi values for increasing

    levels in logarithmic scale. (b) Lines matched in the logarithmic scale

    to the higher levels (6:11) of general coincidence. . . . . . . . . . . . . 136

    4.20 Visualisation of correlations between the improvement of the major-

    ity voting error and the measures from Table 4.4. Coordinates of all

    points represent the measures examined for all 3-element combina-

    tions out of 11 classifiers for which the measures were applied. . . . . 138

  • Contents xiv

    4.21 The diversity separation experiment. Majority voting error limits

    diagrams with the points corresponding to the classification results

    for increasingly trained teams of 5 classifiers. Suspected constant

    diversity of the data matches the lines representing the same values

    of RE measure with independent majority voting error as 0-point

    (E0) 4.21(a), conversely to the second version of the RE measure with

    E0 denoting mean classifier error 4.21(b). . . . . . . . . . . . . . . . . 139

    5.1 Visualisation of the majority voting errors presented in Table 5.3.

    The lighter the field the lower the majority voting error. Details of

    datasets and classifiers are provided in Appendix A. . . . . . . . . . . 158

    5.2 Comparison of the errors from 50 best combinations of classifiers

    found by four population-based searching methods: ES, SS, GS, PS. . 164

    5.3 Evolution of the MVE for the MSF model with a network of 5 layers

    and 15 nodes at each layer. The thick line shows the MVE values

    for the best combinations found by different search algorithms at

    each layer (1-5) of the MSF model. For comparison purposes this

    line starts from the error of the single best classifier (layer 0), the

    level of which is also marked by the dotted line. The thin line shows

    the analogous evolution of the mean MVE from all the combinations

    selected at each layer. Details of datasets and classifiers are provided

    in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    5.4 The network (5 15) resulting from the application of MSF modelwith M = 15 classifiers, majority voting and exhaustive search on the

    phoneme dataset. Layer 0 represents individual classifiers and their

    individual errors are marked underneath. The best combination at

    each layer is marked by an enlarged black circle. The validation

    and testing errors of the best combination at each layer is marked

    respectively below the layer labels. Details of datasets and classifiers

    are provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . 169

  • List of Tables

    4.1 Summary of the measures applied in the experiments. . . . . . . . . . 103

    4.2 Comparison of the time needed to extract cardinalities of all general

    coincidences from a binary matrix of outputs and a collection for

    different number of classifiers. . . . . . . . . . . . . . . . . . . . . . . 121

    4.3 Comparison between the real and approximated values of the major-

    ity voting error for all datasets and applying all 11 classifiers. The

    error rates are shown in percentages. . . . . . . . . . . . . . . . . . . 135

    4.4 Correlations between the improvement of the majority voting error

    over the mean classifier error (MVE-ME) and both versions of the

    RE measure compared against Q statistics and double fault mea-

    sures. The correlation coefficients were measured separately for the

    combinations of 3, 5, 7, and 9 out of 11 classifiers within each dataset. 137

    5.1 Individual best classifier errors for 27 available datasets. The first 3

    columns correspond to majority voting errors obtained for SB applied

    to validation matrix, testing matrix and validation matrix but tested

    on the testing matrix. The following two columns show the index

    of the best classifier evaluated separately in BV and BT matrices.

    Details of datasets and classifiers are provided in Appendix A. . . . . 155

    5.2 Summary of searching methods, selection criteria and datasets used

    in experiments. Description of datasets and classifiers is provided in

    Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

    5.3 Majority voting errors obtained for best combinations of classifiers

    selected by various searching methods (columns) and selection crite-

    ria (rows). The results are averaged over 27 datasets. The bottom

    row and right-most column show the averaged values of MVE for

    the searching methods and selection criteria respectively. Details of

    datasets and classifiers are provided in Appendix A. . . . . . . . . . . 157

    xv

  • Abbreviations xvi

    5.4 Best combination of classifiers found by the exhaustive search from

    the ensemble of 15 classifiers. Columns 2-4 present the MVE val-

    ues for the best combination found in the validation matrix, testing

    matrix and validation best tested on the testing matrix, respectively.

    Columns 4 and 5 show indices of the classifiers forming the best val-

    idation and testing combinations. Details of datasets and classifiers

    are provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . 160

    5.5 Validation errors (obtained from the validation matrices) of the ma-

    jority voting combiner obtained for the best combinations and mean

    from 50 best (if possible) combinations of classifiers found by 8 dif-

    ferent search algorithms for 27 datasets. Details of datasets and clas-

    sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 162

    5.6 Generalisation errors (evaluated on the testing matrices) of the ma-

    jority voting combiner obtained for the best combinations and mean

    from 50 best (if possible) combinations of classifiers found by 8 dif-

    ferent search algorithms for 27 datasets. Details of datasets and clas-

    sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 163

    5.7 Generalisation errors (evaluated on the testing matrices) of the ma-

    jority voting combiner obtained for the best combinations from the

    5-layer selection-fusion model. The columns show the minimum er-

    rors obtained and the layer indices at which the minimum errors were

    observed. Details of datasets and classifiers are provided in Appendix

    A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

    A.1 A list of datasets used in the experiments. . . . . . . . . . . . . . . . 186

    A.2 A list of classifiers used in the experiments. . . . . . . . . . . . . . . . 190

    B.1 Optimal classifiers parameters found exhaustively for each datatset.

    The remaining classifiers (loglc, nmc, pfsvc, knnc, parzenc) have in-

    ternal optimisation or are working well with default parameters. . . . 193

    B.2 Individual classifier errors obtained during classification of 27 datasets.194

  • Abbreviations

    ANN Artificial Neural Network

    BKS Behaviour Knowledge Space

    BS Backward Search

    CC Computational Complexity

    CFD Coincident Failure Diversity

    DCS Dynamic Classifier Selection

    DED Discrete Error Distribution

    DF Boundary Distribution of Failure

    DFD Distinct Failure Diversity

    DI Difficulty measure

    DS Boundary Distribution of Success

    ECOC Error Correcting Output Coding

    EL Eckhardt and Lee

    FM Fault Majority

    FS Forward Search

    GA Genetic Algorithm

    GD Generalised Diversity

    IA Interrater Agreement Measure

    KW Kohavi Wolpert

    xvii

  • Abbreviations xviii

    LM Littlewood and Miller

    MCS Multiple Classifier System

    ME Mean Error

    MMI Maximim Mutual Information

    MOMV Multistage Organisation with Majority Voting

    MV Majority Voting

    MVE Majority Voting Error

    MVI Majority Voting Performance Improvement

    NCED Normalised Continuous Error Distribution

    NDM Non-Pairwise Diversity Measure

    OWA Ordered Weighted Average

    PBIL Population Based Incremental Learning

    PCA Principal Component Analysis

    PDED Partial Discrete Error Distribution

    PDM Pairwise Diversity Measure

    PK Partridge and Krzanowski

    RSD Random Scatter Diversity

    SB Single Best

    SCS Static Classifier Selection

    SD Specialisation Diversity

    SDF Stable Distribution of Failure

    SDS Stable Distribution of Success

    SS Stochastic Hill-Climbing Search

    TS Tabu Search

  • Chapter 1

    Introduction

    Endowed with a number of diverse senses, humans effortlessly tackle astoundingly

    complex processes that underlie the act of pattern recognition. The astonishing

    ease with which we can recognise faces, understand spoken words, eliminate rotten

    eggs by smell, select the right coin from a pocket by touch or distinguish beer from

    champagne by taste are apparently in conflict with the overwhelming complexity

    of computer based pattern recognition systems. The explanation of this superior

    performance seems to be related to highly specialised and complementary sensing

    models that work simultaneously and are combined by a decision mechanism in

    the human brain. Recent advances in combining pattern recognition systems seem

    to support this conjecture although it is still not clear what exactly drives the

    improvement in their performance. Is it complementarity among individual diverse

    classification models, or are there some specific strengths of a particular combiner

    that cause compensation for individual errors observed in classifier fusion systems?

    The unresolved co-involvement between diversity and classifier performances and

    their joint impact on combined performance remains another challenge. Multi-facet

    diversity is believed to be the key to the explanation of performance variability in

    combining classifiers. However due to multitude of perceptions and interpretations

    and hence measuring methodologies, diversity has still no clear bonds with combined

    performance and therefore is not used in applications. These and many other related

    problems prevent a full explanation of the mechanisms ruling classifier fusion and

    hence limiting our ability to predict and control the behaviour of the combined

    performance so much appreciated in commercial applications.

    One of the research project goals is the establishment of the relationship between

    the performance of the combined system and various properties of multiple classifier

    system (MCS). Diversity, identified as a promising descriptive tool, is thoroughly

    investigated and the role it plays in classifier fusion examined in an attempt to

    1

  • Chapter 1. Introduction 2

    provide diagnostic tools invaluable during a complex process of designing MCS. All

    these questions, doubts and challenges are to be addressed in this thesis within a

    general framework of diversity analysis for combined pattern recognition.

    1.1 Background

    Research efforts dedicated to supervised pattern recognition invariably focussed on

    further improvement of the recognition rate, have recently been undergoing a sig-

    nificant change. The traditional continuous development of more and more so-

    phisticated classification models turns out to provide some benefits only in specific

    problem domains where some prior background knowledge or new evidence can be

    exploited to further improve classification performance. In general however, re-

    lated research proves that no individual method can be shown to deal well with

    all kinds of classification tasks [148], [28], [7], [137]. Realisation of the inevitable

    imperfections of individual classifiers catalysed the emergence of a new model de-

    sign strategy assuming combining different classifiers as a main source of perfor-

    mance improvement [137], [7], [158]. Classifier fusion methodology exploded recently

    into a wide variety of models some of which have been shown to be very success-

    ful [148], [28], [7], [137], [80], [81], [15], [165], [60], [158], [71], [53], [55], [65], [58], [27].

    Although spectacular improvement of the recognition rate in combined pattern

    classification systems has been demonstrated on a number of problem domains, the

    explanation of that phenomenon remains vague and very general. On one hand

    the process of classifier fusion is explicit and definable. The complexity of in-

    dividual classification models limits however the interpretability of the combined

    performance behaviour in terms of various individual and relational characteristics

    exhibited among classifiers. Transparency of pattern recognition systems becomes

    a crucial property in commercial or industrial applications, where due to security

    or revenue maximisation, the risk associated with employing a highly complex com-

    posite classification system is high and has to be minimised. To this end various

    attempts at controlling or diagnosing the behaviour of combined performance have

    shown only partially positive and still confusing results [138], [164], [88], [122]. Re-

    flection of that fact can be found in safety critical pattern recognition systems, where

    simple yet well explained and easily controllable techniques commonly based on try

    all and choose the best model are preferred [139].

    Research efforts towards explanations in combined classification systems focus

    on two approaches. One way is to analyse the specific combining method and use

    its characteristics backpropagated into relations among classifiers to model or di-

  • Chapter 1. Introduction 3

    rectly measure combining performance or its improvement [131], [128], [75], [164].

    The other method assumes the existence of underlying diversity among classifiers,

    which together with the individual classifier performances determine in some im-

    plicit way the combined performance. In this interpretation the notion of diver-

    sity embodies the concepts of team strength or complementarity among classifiers

    and is believed to have a key impact on combining performance [126], [89], [140].

    There is though a number of uncertainties associated with diversity on both, the

    conceptual and practical levels. First, it is not clear whether diversity as a con-

    cept is independent of individual performances and the combining method used.

    These doubts directly translate into problems of measuring diversity in a consistent

    manner independent of a number of variable parameters of the multiple classifier

    system [128], [126], [87], [140]. Another aspect which complicates the issue even

    more is a doubt whether diversity should be considered together with the combiner

    and its properties, or should it consistently represent a fixed concept ignoring any

    bonds with the fusion system. In other words it is not clear whether diversity should

    be perceived universally as an independent concept or if it should be biased by the

    specific features of the particular combiner. The latter option would be particularly

    justified by the diagnostic and control requirements so that diversity, being tuned

    to the combiner, could be applied during the design process. Both models of diver-

    sity pursuing explanations of the performance behaviour in combined classification

    systems form the main theme investigated in this thesis. Extensive experimental

    work attempts to justify the practical applicability of diversity during the process

    of composite classifier design and accordingly verify the usefulness of the diversity

    concept for combining classifiers.

    1.2 Project description

    The overall goal of the project is to explore the multi-modal concept of classifiers

    diversity broken down into various interdependencies among individual models from

    classifier ensembles and investigate its explanatory strength in the context of per-

    formance variability of the combined system. Although the notion of diversity is

    approached on many distinct platforms including perception, representation and

    measuring, particular emphasis is put on the potential applicability of diversity

    analysis in the process of designing multiple classifier systems. The research intends

    to exploit diversity as a diagnostic tool capable of guiding or at least indicating

    which classifier ensembles are most likely to show good combined results as opposed

    to those classifiers which if combined do not show any improvement or even lead to

  • Chapter 1. Introduction 4

    deterioration of the performance compared with the individually best model.

    The initial investigations revealed a number of strategies for tackling diversity in

    relation to combining classifiers. However, due to a large size and complexity of the

    problem, the scope of the project is technically narrowed down to the phenomena

    observed and investigated only for majority voting (MV) combiner operating on the

    ensemble of different classification models. Within this setup the notion of classifiers

    diversity is targeted in three different contexts:

    Exploratory investigations of the behaviour of majority voting performanceand its limits - looking at the mechanisms responsible for performance im-

    provement in multiple classifier systems.

    Analysis of the relation between combined performance behaviour and vari-ous models of diversity - trying to identify the bonds between the two and

    investigate the possibilities of their enhancement.

    Diversity in classifier selection - experimental study attempting to apply di-versity measures as effective selection criteria capable of extracting optimal

    ensembles of classifiers.

    These three issues consistently build up into a comprehensive evaluation of the role

    diversity plays in the combined pattern recognition systems and directly justify the

    usefulness of the diversity analysis in designing multiple classifier systems.

    1.3 Original contributions

    This section provides a brief summary of the major original findings arising from

    the study. It serves both to provide a clearer presentation throughout later chapters

    and apparent specification of the thesis contributions to the field. The study has

    been summarised in a number of peer reviewed publications [125], [130], [126], [131],

    [128], [129], [132], [127] encompassing both theoretical and experimental material

    realising the project goals. Contributions concern three problem domains following

    the investigative strategy of the project as briefed in the previous section and are

    summarised in a following list:

    Proposition of a new systematic order and terminology describing in a uniformmanner wide family of classifier fusion systems, Section 2.4, [125].

  • Chapter 1. Introduction 5

    Introduction of the error distribution based analysis of majority voting perfor-mance behaviour for large number of differently performing classifiers, Section

    3.2.2, Section 3.2.3, [130].

    New simple form of the ensemble extendibility condition for independent clas-sifiers, Section 3.2.4, [130].

    Parametric analysis and extensive visualisation of majority voting error limits,Section 3.3, Section 3.3.3, [130].

    Definition of new patterns of boundary distribution of classifier outputs definedfor the full range of mean classifier error, [0,1], and proposition of their stable

    alternatives justified by analysis of classifier margins, Section 3.3, [130].

    Definition of a multistage organisation with majority voting system and pre-senting the effect of error limits widening along with conditions necessary for

    its occurrence, Section 3.4, [130].

    Extensive analysis of the correlation between majority voting error and variousbinary operating diversity measures, Section 4.2, [126].

    Definition of the asymmetry property of diversity measures and showing itsimportance in the correlation analysis, Section 4.2, [126].

    Definition of the Fault Majority measure - as an example of a measure opti-mised to the combiner, Section 4.3.1, [126].

    Presentation of the set-based analysis of error coincidences and using it forrapid extraction of error coincidences among classifiers, definition of new mea-

    sures of diversity and decomposition of majority voting error, Section 4.3, [131].

    Definition of a robust Relative Error measure promoting combiner specificapproach to diversity measures, justified experimentally, Section 4.4, [128].

    Development of a new methodology for pattern classification based on theconcept of information fields inspired from physical potential fields, [129], [132].

    Definition of gravity and electrostatic models of classification and showingtheir good performance in terms of both recognition rate and diversity, [132].

    Development and evaluation of a number of search algorithms applied forclassifier selection with various selection criteria, Section 5.2, [127].

  • Chapter 1. Introduction 6

    Evaluation of diversity measures as a classifier selection criteria, Section 5.2.

    Proposition of the network-based processing of the population of combinationsof classifier outputs, Section 5.3.

    Development of a multilayer selection-fusion model, analysis of its structureoptimality and extensive evaluation showing improvement of the generalisation

    performance, Section 5.3.

    1.4 Organisation of the thesis

    Chapter 2 outlines the context and theoretical background for this work. It provides

    a general overview of pattern recognition methodology, and on grounds of advances

    in information fusion illustrates the state of the art in multiple classifier systems.

    The material presented in the next three chapters covers the original contributions

    summarised in the previous section.

    Chapter 3 attempts to uncover various mechanisms driving performance im-

    provement in majority voting. Parametric analysis of individual error coincidences

    is formalised and used to explain several aspects of the behaviour of MV error and its

    limits. In the second part majority voting is presented in a multistage organisation

    setup and its interesting effects on the combined performance are discussed.

    The next chapter summarises various models and perceptions of the diversity and

    addresses the problem of its representation and measurement. The relation between

    diversity among classifiers and the performance of majority voting is investigated

    experimentally and the results compared with exhaustively extracted optimal en-

    sembles of classifiers. The conclusions drawn from these experiments are directly

    exploited in promoting the new form of diversity, conceptually biased by the defi-

    nition of the combiners performance. Supported by comprehensive analysis of the

    error coincidences, the combiners specific diversity is presented and embodied into

    a series of novel measures ultimately leading to the convergence between the concept

    of diversity and combined performance.

    Chapter 5 focuses on the application side of diversity measures presenting ex-

    tensive experimental results of classifier selection guided by various measures of

    diversity and performance. Among many different selection algorithms and crite-

    ria the best setup is analysed and expanded into a multilayer network preventing

    selection overfitting and improving the generalisation properties of the system.

    The concluding chapter summarises the main findings of the project and indicates

    directions for further research.

  • Chapter 2

    Overview of pattern recognition

    and classifier fusion

    2.1 Introduction

    In the early developments of automated pattern recognition systems, inspirations

    were always being found in the biological world, where we humans exhibit a remark-

    able blend of recognition skills. Humans seem to be more efficient in solving many

    complex, especially vaguely specified classification tasks owing to the natural ability

    to cope with uncertain/ambiguous data coming in variety of forms from different

    sources. In some more specific applications like fingerprint recognition [118] or DNA

    sequence identification [101], automated pattern recognition systems immensely out-

    performed humans mainly due to the enormous size of the data and interdependency

    between the factors to be analysed and processed. It seems then, that a successful

    pattern recognition system has to exhibit both the efficiency of a biological cogni-

    tive system and the processing power of modern computing systems. Indeed in cases

    like vision or speech recognition, understanding biological cognitive mechanisms and

    adopting them on fast computer systems would open enormous capabilities. How-

    ever, there are also pattern recognition problems like DNA identification [101], gas

    detection [62], infra-red target tracking [7], which not only remain far beyond our

    cognitive and processing capabilities but also require specific mathematical models

    and sophisticated hardware sensing of a type unreachable for humans. In general,

    there is no single strategy or recipe for successful pattern recognition systems. In-

    stead there is a rich variety of individual problem dependent methods dealing well

    with very specific problems but failing to generalise well to other tasks.

    In parallel to the efforts at improving individual pattern recognition models, a

    7

  • Chapter 2. Overview of pattern recognition and classifier fusion 8

    completely new trend emerged recently attracting a lot of scientific attention. Fol-

    lowing the advances made in electronics and computer science, pattern recognition

    had been undergoing a rapid improvement encouraged by gradually relaxing com-

    plexity constraints. The pioneering efforts of Dasarathy [22], but also these of many

    other works reviewed in [22], initiated an entirely new branch of pattern recognition -

    classifier fusion. The inspiration can be traced back as far as ancient Greece, citizens

    of which were the first who reached decisions collectively in order to improve their

    quality and minimise the risk of individual failures [116]. Omnipresent in current so-

    cieties, group decision making indeed proves to secure well balanced decisions crucial

    for the stability and prosperity of todays democracies [50], [134], [9]. In a similar

    fashion it has been noticed that applying multiple classification models for the same

    task and combining their results could lead to spectacular performance improve-

    ments compared with the individual best model [158], [22], [121], [58]. It turned out

    that fusion may in fact be successful not only if applied for classifier decisions but at

    other stages of the classification cycle starting from data fusion [49] [7], [54], [32], [36]

    through feature (processed data) fusion [7], [68], [35], [33] up to the aforementioned

    classifier fusion [22], [137], [7]. Section 2.3 discusses in detail various issues related

    to information fusion.

    These findings triggered development of very complex systems where mixtures

    of fusion, combining and selection of partial evidence applied to input data, features

    or classifier outputs cover uncountable variations and structures of potential pat-

    tern recognition systems. It is therefore not surprising that due to the potentially

    large variety of the combined pattern recognition designs, there is still no consis-

    tent and commonly agreed taxonomy naming and categorising different combining

    techniques. Some recent attempts at a very general classification of fusion methods

    into coverage and decision optimisation techniques [57] assume that either the clas-

    sifiers or the combiner is to be optimised while the other remains fixed. However,

    the state of the art in classifier fusion seems to be much wider and more complex

    with the multiplicity of classifiers used in many different ways beyond these two

    mentioned types of combining. One example could be a modular decomposition

    system where the single best or a number of best classifiers are applied to differ-

    ent classification subtasks controlled by the classifier selection process [137], [122].

    Moreover, combining classifiers involves a number of other aspects including archi-

    tectures for combining, training abilities of the fusion operator and may relate also

    to fusion on different levels of abstraction within the classification cycle [7]. On top

    of that, all different styles, paradigms and properties of combining may appear at

    the same time during the design process. For example there is nothing wrong with

  • Chapter 2. Overview of pattern recognition and classifier fusion 9

    coverage and decision optimisation methods being combined together. Facing this

    pudding of varieties, rather than contributing to the overall non-specificity in the

    field we present the classifier fusion as scheme uniformly described by three distinct

    properties and show in Section 2.4.5, that this noncompetitive approach covers all

    different models and designs of combining.

    The high complexity and hence the computational power demands of classifier

    fusion systems is one of the reasons they are not widely applied yet. Among other

    reasons, the major problem is the lack of interpretability of complex systems. Un-

    fortunately these drawbacks usually eliminate such fusion systems from industrial

    applications - where suddenly emerging problems require a quick explanation and

    fix, while the system performance should be predictable and stable. Although com-

    plexity can be increasingly dealt with and there is a prospect for the stability gain,

    there is very little one can usually do with systems that occasionally do not work,

    or work beyond ones control. The major issue addressed in this thesis - diversity

    among classifiers, is believed to provide theoretical and practical answers accounting

    for the diagnostic and explanative capabilities of diversity in the context of classifier

    fusion.

    The term diversity related to combining evidences originated from the software

    engineering domain [29], [73], [99], [112], where the reliability of conventionally coded

    programs was improved by combining independently written versions of the same

    algorithm. Appearing under many names in the literature, diversity is believed to

    be a major source of performance improvement in combined pattern recognition

    [110], [138], [111], [131], [87]. A large variety of representations, models and data

    types represent some of the many faces of diversity related to classifiers. In this

    thesis the emphasis is put on practical aspects of diversity, the ways it can be

    measured, understood and eventually whether it can explain and possibly diagnose

    why and when combining classifiers could be an effective alternative to individual

    classification models. Detailed conceptual and experimental investigations related

    to diversity are undertaken in Chapter 4 and partially in Chapter 5.

    2.2 Pattern classification

    Pattern recognition is a scientific discipline one of whose goals is to classify objects

    into a number of categories called classes. Objects represent compact data units

    specific to a particular problem like images, spoken words, handwritten characters

    and are in general referred to as patterns. The process of pattern recognition nor-

    mally entails a sequence of well separated operations [28]. It begins with collecting

  • Chapter 2. Overview of pattern recognition and classifier fusion 10

    the evidence acquired from various sensing devices. In the ideal situation the data is

    low-dimensional, independent and discriminative so that its values are very similar

    for patterns in the same class but very different for patterns from different classes.

    Raw data rarely satisfies these conditions and therefore a set of procedures called

    feature generation, extraction and selection is required to provide a relevant input

    for classification system. Data sensing and feature extraction is beyond the scope

    of this thesis. It is noted however that the product of these two components of

    the pattern recognition design are feature vectors representing the input data for

    classification systems.

    Given the feature vectors x X provided by a feature extractor, the objectiveof the supervised classification method, the classifier, is to assign the new object x

    to a relevant class j , where = {1, ..., C}, based on previous observationsof labelled patterns: XT = {x, } - training data. The overall classification processcan be broken down into four major components: model choice, data preprocessing,

    training and testing or evaluation. Evaluation closes the classification part in the

    pattern recognition design which then enters the post-processing and overall system

    evaluation stage. There is a great flexibility of operation in this last phase of pattern

    recognition design. It may just involve risk or reliability analysis, it could be system

    tuning aimed at minimising the cost or further context-based optimisation. There

    is also space for combining classifiers or in general for processing the outputs from

    many classifiers returned from the classification process. The diagram of pattern

    recognition design and the subset involving the classification cycle is shown in Figure

    2.1.

    The major issue treated in this thesis - diversity among classifiers - narrows

    down the operational scope to just the two last components of the pattern recogni-

    tion design: classification and post-processing. Classification broken down into the

    design cycle is presented in the following section, with a particular emphasis put on

    the limitations of the individual model implementation. This is followed by a for-

    mal definition of classification error pointing out its sources and indicating methods

    for its elimination leading to the development of the combined system presented in

    Section 2.4.

    2.2.1 Classifier design cycle

    In the supervised pattern recognition task considered in this thesis, the classifiers

    goal is to assign the unlabelled object x to the class label based on the evidence

    learned from the labelled training set XT : {xi, j}. Mathematically classifiers

  • Chapter 2. Overview of pattern recognition and classifier fusion 11

    Figure 2.1: Pattern recognition and classification design cycles

  • Chapter 2. Overview of pattern recognition and classifier fusion 12

    represent simply a discriminative function trying to separate classes from each other

    in the multidimensional input space. In a general case such function provides class

    support vectors: = [w1, ..., wC ], which depending on the classification model may

    represent probabilities, fuzzy membership values or any other measures that can be

    understood, compared and handled in the post-processing phase. Classification can

    be therefore interpreted as a mapping:

    D = f([x1, ..., xK ]T ) = [y1, ..., yC ]

    T (2.1)

    where yj denotes a degree of support for class j estimating the probability P (j|x).The difficulty of the classification problem depends on the variability in the feature

    values within the same classes relative to the differences between feature values for

    patterns from different classes. Among other phenomena complicating the classi-

    fication task, the major contribution is attributed to the lack or incompleteness

    of the data, the high complexity of the problem and, above all, noise accounting

    for all kinds of randomness in pattern variability that is not due to underlying

    model [28], [148]. The performance of a classifier becomes the result of the trade-

    off between the conceptual adequacy of the classification model and its complexity

    control mechanisms. As mentioned before, the classification process can be seg-

    mented into four distinct operations: model choice, data preprocessing, training,

    and evaluation.

    Model choice

    The decision regarding the selection of the classification model is very important

    and difficult especially if there is little prior knowledge about the nature of the

    problem. Additional difficulties come from a fact that the classification process is to

    a large extent unpredictable and quite often nondeterministic which means that the

    choice can not be immediately justified. The only effective quantitative feedback

    comes from the evaluation of the overall classifier performance which means that

    a designer has to come through the whole classification cycle to verify his choice.

    Sometimes assumptions made by a classifier match the problem characteristics or

    the problem is so specific that there is only one method suitable, in which cases

    the choice is straightforward. In general however, with respect to the no free lunch

    theorem [28], [156], there is no individual method providing the best solution for all

    types of pattern recognition problems. In a typical scenario, given a classification

    problem, the designer has typically plenty of different classification models at hand

    and optimistically only a rough idea which ones could be the most successful. Unless

  • Chapter 2. Overview of pattern recognition and classifier fusion 13

    there is clear evidence of the model match to the problem, quite trivially a tedious

    try all and choose the best approach seems to provide a justifiable strategy. Even

    then, due to limited evaluation capabilities, assigning a single classifier to the task

    puts the optimality of performance at risk. Another aspect arising from the model

    selection stage is the loss of valuable evidence provided by competitive classifiers

    ranked just behind the winner. These conceptual and practical difficulties in classi-

    fier selection contributed to the development of classifier fusion systems, where all

    the complementary evidence and knowledge is jointly incorporated into the decision

    process. Further details related to classifier fusion are presented in Sections 2.3 and

    2.4.

    Data collection and preprocessing

    Once the model is chosen the input data are prepared to be passed on to a classifier.

    These data are in fact k-component feature vectors of the form: x = [x1, ..., xk]T

    returned from the feature extraction stage of pattern recognition design. Individual

    patterns represent points in the k-dimensional input space, examples of which are

    depicted for two dimensional cases in Figure 2.2.

    Although during the feature extraction phase, the data may have already been

    preprocessed to enhance its class discriminative power, the choice of the classification

    model usually dictates further adjustments. Various types of normalisation are

    routinely required. For example to achieve invariance to displacements and scale

    changes, one might transform the data so that they have zero mean and unit variance

    [148]. Some models may require the data only from the specific range, for example

    (0, 1), in which case normalisation has also to be applied [35], [32], [33], [35], [36], [34].

    Normalisation may destroy the original data structure if there are some outliers,

    hence removal of outliers may be required prior to normalisation [132], [148]. Missing

    feature values is another common problem related to the data that has to be treated

    to avoid failures [28], [148], [34], [106]. For some complex classifiers the number of

    features returned from the feature extraction process may lead to intractability.

    Various techniques aiming at reducing the data size may therefore be required.

    Applying various data editing or data condensation techniques [18], [83] would reduce

    directly the number of patterns while trying to preserve the structure of the data.

    Alternatively, data dimensionality may be targeted and methods based on feature

    selection [28], principal/independent component analysis (PCA/ICA) [107], [63] or

    maximum mutual information (MMI) [119], [149] applied to reduce the number of

    dimensions with minimal impact on the discriminatory strength of the remaining

  • Chapter 2. Overview of pattern recognition and classifier fusion 14

    features.

    Further processing may be required if a multiple classifier system is to be ap-

    plied. The input space may for instance be segmented and the training set effec-

    tively split into parts fed to different classifiers like in dynamic classifier selection

    (DCS) [41], [43], [40] systems. For the same purpose, the data may be grouped into

    many subsets of features and applied separately for building many versions of the

    model to be combined [84], [164]. Finally there is yet another reason for data prepro-

    cessing prior to classification. Different classifiers may be encouraged to be diverse

    by providing as much distinct evidence related to the same problem as possible.

    Alongside already mentioned input space partitioning or selecting different feature

    subsets, there are also simpler methods like injecting noise or differentiation of ini-

    tial conditions [25] and many different linear and non-linear transformations [138]

    that could be potentially used to enforce diversity among classifiers.

    2 1 0 1 2 3 4 5 6 76

    4

    2

    0

    2

    4

    6

    8

    (a) artificial, 2-D, 8 classes

    1 2 3 4 5 6 7 8 9 10 111

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    (b) artificial, 2-D, 3 classes

    Figure 2.2: Two examples of two dimensional datasets.

    Training

    Training is the actual process of classifier learning. Although this thesis is only

    concerned with supervised learning, the training process is a good place to briefly

    discuss different learning models [28] as they directly affect the way training is

    carried out.

    Depending on the availability and reliability of the evidence one can distinguish

    three learning strategies: supervised, unsupervised and reinforcement learning. In

    supervised learning the classifier is given a labelled training set to build the model on.

    It is called supervised as it could be thought of, as the teacher providing the patterns

    and their true classes on the basis of which the classifier model learns how to return

  • Chapter 2. Overview of pattern recognition and classifier fusion 15

    an optimal solution to the problem. In some cases training data of known classes

    may not be available, which eliminates the availability of a teacher. Such learning

    on the basis of unlabelled data is called unsupervised learning. In intermediate

    reinforcement learning although the true labels of patterns are not available, the

    feedback is given on whether the classifier output is correct or incorrect, without

    specifying what is the correct answer.

    Classification models are normally fully learnt from labelled pattern examples.

    The major fact to be realised is that the number of labelled data is limited and

    usually very small and costly to obtain. Another important fact is that these data

    have to be also used for performance evaluation. This implies that a part of the

    available data has to be left out for testing purposes, which further narrows down

    the amount of data to be used for a proper training of the classifier.

    Given a set of all available labelled data X:

    X : {xi = [x1, ..., xk]T , j } i = 1, ..., N j = 1, ..., C (2.2)

    we denote the training set by XT where XT X and note that the remainingdata: XE = X \XT 1 will be used for testing (see Section 2.2.1). Normally the moretraining data is used the more adequately the model reflects a problem and the better

    its performance. Some characteristics of the classifier training is captured in the

    form of a learning curve, showing the relation between the classifiers generalisation

    performance and the size of the training set used to train the model. Figure 2.3(a)

    shows examples of such learning curves for three typical classifiers. The examples

    present three types of learning behaviour. For the first linear classifier, adding more

    training data does not improve its performance as the data are simply highly non-

    linear. The decision tree classifier shows the optimal amount of training data above

    which it becomes overtrained. The third highly non-linear k nearest neighbour

    classifier seems to benefit constantly from adding more training data although at

    the level of 400 samples it seems to reach a plateau and adding lots of new training

    data does not improve the classifiers performance significantly. What it certainly

    does though, is increase model complexity and reduce the size of prospective testing

    samples. If the size of the labelled data is seriously limited then some more elaborate

    splitting and error estimation techniques are required [154]. Figure 2.3 provides also

    visualisation of the three classifiers after the training. For the 2-dimensional problem

    it visualises discriminative functions and shows the resulting decision boundaries.

    1\ denotes set subtraction operator: A \ B = C C = A B

  • Chapter 2. Overview of pattern recognition and classifier fusion 16

    50 100 150 200 250 300 350 4000.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    No of samples

    Err

    or r

    ate

    ldc

    treec

    knnc

    (a) learning curves: artificial, 2-D, 8 classes

    2

    0

    2

    4

    6

    64

    20

    24

    68

    2

    1.5

    1

    0.5

    0

    (b) linear discriminant classifier

    2

    0

    2

    4

    6

    64

    20

    24

    68

    2

    1.5

    1

    0.5

    0

    (c) decision tree classifier

    2

    0

    2

    4

    6

    64

    20

    24

    68

    2

    1.5

    1

    0.5

    0

    (d) k nearest neighbours classifier

    Figure 2.3: Visualisation of the training process for 3 common classifiers. Plots b,c,dshow superposition of discriminative functions within 2-dimensional feature space.

    Testing

    The importance of model evaluation stems from the fact that it provides the most in-

    formative measure of classifier performance which then could justify its use, leading

    to possible optimisation, redesign or elimination if other models show better per-

    formance. The common belief that a more elaborate classifier producing complex

    non-linear class boundaries is better than simple linear models may not be always

    true. Complex models tend to overfit the training data so that although their per-

    formance on the training set is usually much better than simple linear models, they

    could show very weak performance for new patterns [28]. Data overfitting is a typ-

    ical trap for sophisticated systems unless some complexity control mechanisms are

    incorporated in the design of such a classifier. It is believed that a model with

    well-balanced complexity should perform similarly on the training and testing data

    as well as any data from the problem domain [28], [148].

    Given the limited amount of training data, the precise estimation of the true

  • Chapter 2. Overview of pattern recognition and classifier fusion 17

    model performance or error rate is quite a challenge. There is no issue if the size

    of available training data is huge compared to the number of classes. According

    to standard statistical analysis carried out in [154], 1000 testing samples should

    provide satisfactory error tolerance of the predicted performance for most of the

    cases. Problems start to emerge if there is less or much less data available. Random

    multiple splitting into training and testing sets is the simplest method to enhance

    the reliability of performance estimation. For smaller testing sets multiple splitting

    still holds a high risk that some regions of the input space may be scarcely covered

    leading to substantial bias in performance estimate. In such cases multiple cross-

    validation procedures show quite satisfactory results [154]. In cross-validation, the

    testing set is rotated over exclusive subsets exhaustively covering the whole dataset.

    The extreme cross-validation with a rotation of only a single pattern used for testing

    is called leave-one-out [154] and is preferred whenever its application is computa-

    tionally tractable. For sample sizes smaller than 50 leave one-out can be supported

    by boot-strapping [154], [28], generating a test set by sampling with replacement from

    a training set. More precise guidelines for the use of true performance estimation

    methods depending on the size of the testing set can be found in [154].

    A final comment relates to the combined systems where the combiner may require

    individual classifier performance estimates to decide which ones to combine. In

    such a case apart from the proper training and testing sets used for individual

    performance estimations, there is a need for additional validation set to be used for

    the estimate of combiner performance. Normally, the combiner could be perceived

    as a more general classifier, which would require a separate set for building the

    combination model and separate for testing its performance. However, separating

    additional set from the overall classification dataset would further limit training and

    evaluation capabilities of individual classifiers.

    Due to the large number of classifiers and datasets considered throughout the

    experimental parts of this thesis, estimation of individual performances is based on

    random multiple splitting. The estimations of combiner performance is based on

    the same testing set as the one used for evaluation of individual classifier perfor-

    mances. These choices have been taken to maintain simplicity and uniformity of the

    experimental results and to ensure a coherent comparison between individual and

    combined performances.

  • Chapter 2. Overview of pattern recognition and classifier fusion 18

    2.2.2 Classification error

    Pattern classification incorporates supervised learning mechanisms and therefore

    shares a similar description of the model error [28], [137]. The major objective

    of supervised learning is to construct a predictor which, given the limited amount

    of training data, will be able to estimate a target function T : x y with apossibly minimum error. Excluding artificial data, mapping x y usually reflectsthe real-world learning problem, which is commonly dependent on a large number

    of factors. Due to a number of constraints the predictor tries to select only the

    minimum number of factors, which contemporarily describe the problem and are

    sufficient to give reliable predictions. However the fact that they never cover the

    whole knowledge space supporting the solution of the problem, limits the ability of

    generating correct outputs according to the following formula:

    y = E(y|x) + (2.3)

    where E(y|x) represents the expectation operator of y given x and stands for whitenoise. An additional portion of model error stems from the limited, usually small,

    training set. Instead of using a whole input space X, which is commonly unknown,

    the predictor uses only selected known training data XT for generation of predictions

    for unknown data: f(x,XT ) with an unknown level of representativeness related to

    x. After this additional constraint all considerations are forced to be targeted at

    training dataset XT , which could be additionally split in order to leave out some

    part for testing the accuracy of predictions. The total mean squared error of the

    model can be now formulated as [39], [137]:

    e2f = EXT {[y f(x,XT )]2} = E(2) + EXT {[E(y|x) f(x,XT )]2} (2.4)

    Some further algebra results in:

    e2f = E(2)

    noise

    + E2XT [f(x,XT ) E(y|x)]

    bias

    + EXT {[f(x,XT ) E[f(x,XT )]]2}

    variance

    (2.5)

    As it is clear from the equation (2.5) a simple decomposition leads to a separation

    of three independent components of model error. The first term is called white noise,

    which cannot be reduced unless further evidence is provided. The second term - bias

    can be intuitively characterised as a measure of predictors ability to generalise well

    once trained. Finally, the third term variance can be similarly interpreted as a

  • Chapter 2. Overview of pattern recognition and classifier fusion 19

    measure of sensitivity of predictor outputs over different training sets. The model

    error can be therefore rewritten in the concise form as:

    e2f = 2 + B2(f) + V (f) (2.6)

    In the classification model the only difference from the general prediction model

    comes from the fact that classification operates on assignments to the crisp class

    labels j as elements from the set . The individual error of the classifier occurs

    thus in the form of picking the wrong class label, not by bias from some true value

    measured continuously like in regression problems. The variability of the classifi-

    cation outputs requires a specific description leading to slight differences in error

    representation compared with the prediction models as shown in 2.5. Considering

    classification error within a probabilistic frame of reference, each classifier produces

    probabilistic outputs supporting different classes: Di = [p(1), ..., p(C)]T . Denot-

    ing by T = arg max[p(j|x)] the true class for a given input pattern x and by fthe classifier choice arising from f = arg max[f(x,XT ) = ], error decomposition

    can be reformulated from (2.5) to the following form [137]:

    ef = 1 p(T |x)

    Bayes error

    + p(f |f, x)[p(T |x) p(f |x)]

    bias

    +

    6=f

    p(|f, x)[p(T |x) p(|x)]

    spread

    (2.7)

    The Bayes error appearing in the equation (2.7) in place of noise component

    (2.5) forms the lower bound on the classification error and is only a function of

    the problem complexity and the available evidence. The bias expresses classifier

    goodness in modelling the problem while the spread (equivalent to variance in (2.5))

    describes the variability of the model outputs.

    While Bayes error component can not be reduced by any means, the remaining

    bias and spread error components remain fixed only for individual classifiers. In

    multiple classifier systems, the spread component is likely to be reduced by parallel

    combining of redundant classifiers [137]. In such case the variability of classifier

    outputs is stabilised as a result of applied aggregation [137]. On the other hand,

    bias can only be reduced as a result of the better classification model, which can be

    potentially achieved by applying modular decomposition of the classification task

    and assign different classifiers to the subtasks for which they perform the best [137].

    More detailed analysis of the error in combined multiple classifier systems will be

    provided in Section 2.4.5 treating about different combining paradigm models.

  • Chapter 2. Overview of pattern recognition and classifier fusion 20

    2.3 Information fusion

    Two important facts related to the reality of the end of the 20th century contributed

    to the enormous dynamics we observe today in the area of evidence fusion. The first

    was the emergence of multi-modal detection systems providing coordinated data

    from multiple sensors of different types facilitated by immense information content

    from highly developed interconnected information systems [49], [7]. Treating all

    types of evidence separately with a single method was an unsuccessful option, lead-

    ing to either complex hybridisation of the system or no gain in performance. What

    led to the breakthrough was the fusion of distinct evidence on many different levels

    from pure data to the decisions of individual experts operating on different parts

    of the available evidence [49], [7], [22]. Another important point to note was that

    individual classification methods provide alternative knowledge even in the absence

    of alternative data. It turned out that even if applied to the same task using the

    same data, a joint decision of combined classifiers is potentially more effective than

    any one individual [22], [15], [70], [137]. These facts, emerging in an environment

    of rapidly growing technology, cheap computational power and exponentially ex-

    panding internet resources led to a sudden turn to fusion in the pattern recognition

    domain.

    Fusion of information can be carried out on many different levels of abstrac-

    tion closely connected with the flow of the classification process: data level fusion,

    feature level fusion, and classifier fusion [7]. There is little theory about the first

    two levels of information fusion. However, there have been successful attempts to

    transform numerical, interval and linguistic data into a single space of symmetric

    trapezoidal fuzzy numbers [54], [115], and some heuristic methods have been suc-

    cessfully used for feature level fusion [7], [68]. Classifier fusion has attracted most

    scientific attention and continues to expand under many different names includ-

    ing: classifier fusion, combining classifiers, mixture of experts, ensemble systems,

    multiple classifier systems, composite classifiers etc [22], [137], [7], [25], [70], [122].

    2.3.1 Data fusion

    At the basic level of data sensing, the fusion of data from various modalities has been

    used to resolve the occlusion problem in vision systems [7]. In another application,

    fusion of differently sensed images improved object detection by overlapping many

    partially discriminative projections [54]. In [54], [115] a method of combining various

    types of data is presented. The proposed new model of data called heterogenous

    fuzzy data, incorporates characteristics of real numerical values, confidence intervals

  • Chapter 2. Overview of pattern recognition and classifier fusion 21

    and linguistic information in a single representation. A generic neuro-fuzzy pattern

    recognition model in which data can be processed in a generalised form of confidence

    intervals has also been proposed in [32], [36]. These studies are supported by the

    theory of fuzzy sets details of which can be found in [163], [72], [114]. Emerging from

    this fuzzy measures are considered as generalisation of probabilistic measures within

    the general theory of evidence [72], and provide various information modelling tools

    that can be used in data fusion.

    2.3.2 Feature fusion

    There is little evidence of the feature fusion in the literature. Fusion on this level

    is considered more general compared to the data fusion and often resembles clas-

    sifier fusion techniques. Some authors suggest even that the difference between

    feature fusion and combining classifier is somewhat arbitrary [7]. It commonly in-

    volves combining multidimensional quantitative feature vectors possibly supported

    by some qualitative measures. An example of feature fusion has been shown by

    Keller and Gader [67] where the data features extracted from Geo-Centers GPR

    system have been combined by a fuzzy rule incorporating some shape characteris-

    tics of the raw data. Again the improvement obtained in a form of reduction of

    false alarms has been observed. Another example of what may be considered a fea-

    ture fusion has been proposed in [33], where the combination of multiple versions

    of neuro-fuzzy classifiers is performed at the classifier model level. In this approach

    hyperbox fuzzy sets representing clusters of data in different models are combined.

    The resulting classifier complexity and transparency is comparable with classifiers

    generated during a single cross-validation procedure, while the improved classifica-

    tion performance and reduced variance is comparable to the ensemble of classifiers

    with combined decisions.

    2.3.