disi - university of trento sampling strategies for...
TRANSCRIPT
PhD Dissertation
International Doctorate School in Information andCommunication Technologies
DISI - University of Trento
Sampling Strategies for Expensive Data
Emanuele Olivetti
Advisor:
Paolo Avesani
Co-Advisor:
Sriharsha Veeramachaneni
March 2008
Abstract
In domains where data collection can be guided in order to reduce the
collection costs, many issues arises like: which sampling strategy to adopt,
how to estimate quantities in presence of missing values (a byproduct of
incremental data collection), how to deal with sampling bias and many
others.
We present two kind of results on this topic. First a principled crite-
rion and algorithm to select which missing data to collect that are more
likely to provide useful information for estimating a given target concept.
The criterion can be intuitively summarized as “sample where maximum
estimated change is expected”. Second a set of examples and applica-
tions that implements that criterion in practical problems under different
assumptions. The applications presented focus on the Machine Learning
problem of feature relevance estimation interleaved with incremental data
collection.
We show experimentally the efficacy of the proposed criterion and its
implementations mainly on two data collection tasks. In the first we esti-
mate the change of accuracy in prediction when adding one new variable to
a class labelled dataset. In the second we extend the problem from one to
many new competing variables and their concurrent relevance estimation.
The experiments are performed on various datasets: pseudo-randomly gen-
erated and benchmarks. Morevoer the first task is studied on two real life
problems that motivated our interested in this research topic. The first
is about efficient assessment of new variables describing a disease in the
agricultural domain and the second is about estimation of the importance
of biochemical markers for cancer prediction. Comparison through experi-
ments against random sampling or other state-of-the-art sampling methods
demonstrates the superiority of the proposed solution.
Keywords
active learning, incremental sampling, Bayesian experimental design, miss-
ing data, feature relevance estimation.
4
Acknowledgment
I am deeply grateful to my wife Laura and my daughter Nausicaa for
their wonderful support during these three years. We shared many joys
and troubles together. But I am in great debt with them for the very
large amount of time I spent working and subtracted from their care. This
achievement is theirs as much as it is mine.
I wish to thank my friend Sriharsha Veeramachaneni, with whom I made
this journey. I walked with him along the hard paths of this research. His
amazing mathematical skills and intellectual honesty always drove us to
the right direction. I am grateful to Paolo Avesani for his wise leadership
during the six years I worked in his group. He gave me the opportunity
to work in scientific research. Together we glided along many challenges.
And now we are ready for the next one.
I would like to thank Prof. George Nagy for his direct and indirect help.
His vast experience, intuition and commitment to the scientific research are
of great inspiration to me. I am grateful to Prof. Donato Malerba and Prof.
Enrico Blanzieri for evaluating this work.
Many other people had influence on me and my work, but some of them
more than others. I wish to thank them. First my parents Franco and
Lina, who always encouraged me in pursuing my inclinations. Second my
aikido sensei Donatella Lagorio, who taught me the solution to almost all
problems: taking one step forward.
This work is dedicated to the memory of Elisabetta Vindimian. You
were with us when we began, and all these were just seeds. Now they
blossom into beautiful flowers.
Contents
1 Introduction 1
1.1 Sampling Expensive Data . . . . . . . . . . . . . . . . . . 2
1.1.1 Introductory Example . . . . . . . . . . . . . . . . 2
1.1.2 Extended Example . . . . . . . . . . . . . . . . . . 6
1.1.3 Abstraction . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Agricultural Domain . . . . . . . . . . . . . . . . . 10
1.2.2 Biomedical Domain . . . . . . . . . . . . . . . . . . 11
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 15
2.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . 15
2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Active Feature Sampling . . . . . . . . . . . . . . . . . . . 17
2.4 Missing Data and Selection Bias . . . . . . . . . . . . . . . 20
2.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Problem and Algorithm 23
3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 26
i
3.2.1 Missing Data . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Estimation Theory . . . . . . . . . . . . . . . . . . 29
3.2.3 Bayesian Exp. Design . . . . . . . . . . . . . . . . 30
3.3 MAC: Derivation . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 MAC: Information Gain . . . . . . . . . . . . . . . 35
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 MAC: implementation 39
4.1 Restrictive Assumptions . . . . . . . . . . . . . . . . . . . 40
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Example 1: Number . . . . . . . . . . . . . . . . . . . . . 43
4.4 Example 2: Cond.Prob. . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Formal Description of the Urn Experiment . . . . . 46
4.4.2 MAC sampling algorithm . . . . . . . . . . . . . . 46
4.4.3 Explicit Benefit Function . . . . . . . . . . . . . . . 47
4.5 Application 1 . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . 54
4.5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Application 2 . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2 Implementation . . . . . . . . . . . . . . . . . . . . 59
4.6.3 Mixture model . . . . . . . . . . . . . . . . . . . . 61
4.6.4 Class-Conditional Mixture of Product Distributions 62
4.6.5 Feature Relevances . . . . . . . . . . . . . . . . . . 63
4.6.6 Conditional Prob. . . . . . . . . . . . . . . . . . . . 64
4.6.7 Parameter Estimation . . . . . . . . . . . . . . . . 65
4.6.8 Comparison with Application 1 . . . . . . . . . . . 68
4.7 Computational Issues . . . . . . . . . . . . . . . . . . . . . 69
ii
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Experiments 73
5.1 Conditional Prob. . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Detailed Results . . . . . . . . . . . . . . . . . . . . 77
5.1.2 Summary of Results . . . . . . . . . . . . . . . . . 82
5.2 Single Rel.Est. . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . 88
5.2.2 UCI Benchmark Data . . . . . . . . . . . . . . . . 89
5.2.3 Data from Agriculture Domain . . . . . . . . . . . 92
5.2.4 Data from Biomedical Domain . . . . . . . . . . . . 100
5.3 Multiple Feat.Rel. . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . 110
5.3.2 UCI Datasets . . . . . . . . . . . . . . . . . . . . . 111
5.3.3 Computational Complexity Issues . . . . . . . . . . 113
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6 Conclusions and Future Work 121
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3 New Directions . . . . . . . . . . . . . . . . . . . . . . . . 125
A EM 129
A.1 Complete Data . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Incomplete Data . . . . . . . . . . . . . . . . . . . . . . . 132
Bibliography 137
iii
List of Tables
4.1 Conditional probabilities P (X2 = x2|X1 = x1) parametrized
by a and b . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Joint probability P (X1 = x1, X2 = x2) parametrized by a, b
and c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Binary counts: nij is the number of observations for which
X1 = i and X2 = j. . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Brief description of the six groups of results. In the first
columns it is shown the number of the group. In the fol-
lowing columns under a, b and c it is shown for each group
whether the value of the parameters is different, equal/close,
greater or lesser than 1/2. . . . . . . . . . . . . . . . . . . 76
5.2 Summary of the results about the comparison of MAC sam-
pling algorithm vs. random policy over 125 generated datasets. 86
5.3 Agriculture Data. The number of samples required (out
of a total of 520) for the rms difference between the true
and estimated error rate to be less than 0.005 for various
sampling algorithms. Each row corresponds to one selection
of the previous feature X and the candidate feature X. The
rms values are computed over 1000 runs. . . . . . . . . . . 100
5.4 Features describing biomarkers data. . . . . . . . . . . . . 101
5.5 Configurations of experiments on biomarkers data. . . . . . 102
v
List of Figures
1.1 Introductory example: class-conditional distributions of tem-
perature and chemical tests. . . . . . . . . . . . . . . . . . 4
1.2 Introductory example: description of the active sampling
process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Extended example: description of the active sampling process. 9
5.1 Average root mean square difference between true and esti-
mated conditional probabilities. Results for Group 1 and
2 comparing MAC policy to random policy. . . . . . . . . 78
5.2 Average root mean square difference between true and esti-
mated conditional probabilities. Results for Group 3 and
4 comparing MAC policy to random policy. . . . . . . . . 79
5.3 Average root mean square difference between true and esti-
mated conditional probabilities. Results for Group 5 and
6 comparing MAC policy to random policy. . . . . . . . . 80
5.4 Example 2: Plots of the average estimates of parameters
a and b while sampling new data. MAC algorithm (•) is
compared to random algorithm (+). Results of the case
a = 0.1, b = 0.9 and c = 0.9 (representing Group 1) are
plotted on top panel. Results for a = 0.1, b = 0.9 and
c = 0.5 (representing Group 2) are on bottom panel . . . 83
vii
5.5 Example 2: Plots of the average estimates of parameters
a and b while sampling new data. MAC algorithm (•) is
compared to random algorithm (+). Results of the case
a = 0.1, b = 0.5 and c = 0.9 (representing Group 3) are
plotted on top panel. Results for a = 0.1, b = 0.5 and
c = 0.1 (representing Group 4) are on bottom panel . . . 84
5.6 Example 2: Plots of the average estimates of parameters
a and b while sampling new data. MAC algorithm (•) is
compared to random algorithm (+). Results of the case
a = 0.5, b = 0.5 and c = 0.1 (representing Group 5) are
plotted on top panel. Results for a = 0.5, b = 0.5 and
c = 0.5 (representing Group 6) are on bottom panel . . . 85
5.7 True class-conditional (X, X1) and (X, X2) variables distri-
butions. The data points before and after measuring the
candidate features are also shown. . . . . . . . . . . . . . . 88
5.8 Synthetic Data. The root mean square difference between
the estimated and true error rate after the candidate feature
is added as a function of the number of samples probed for
all sampling policies. Rms difference is averaged over 1000
repetitions of the experiment. . . . . . . . . . . . . . . . . 90
5.9 Solar Flares dataset. The average root mean square differ-
ence between the estimated and true error rate after the
candidate feature is added as a function of the number of
samples probed for all sampling policies. Rms difference is
averaged over 100 repetitions of the experiment. . . . . . . 93
viii
5.10 Balance Scale dataset. The average root mean square dif-
ference between the estimated and true error rate after the
candidate feature is added as a function of the number of
samples probed for all sampling policies. Rms difference is
averaged over 100 repetitions of the experiment. Due to the
large size of the feature space only the first 100 samples are
acquired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.11 Solar Flares dataset. The average root mean square differ-
ence between the estimated and true error rate after the
candidate feature is added as a function of the number of
samples probed for all sampling policies. Rms difference is
averaged over 100 repetitions of the experiment. . . . . . . 95
5.12 Breast Cancer Wisconsin dataset. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 100 repetitions of the experiment.
Due to the large size of the feature space only the first 100
samples are acquired. . . . . . . . . . . . . . . . . . . . . . 96
5.13 Mushroom dataset. The average root mean square differ-
ence between the estimated and true error rate after the
candidate feature is added as a function of the number of
samples probed for all sampling policies. Rms difference is
averaged over 100 repetitions of the experiment. Due to the
large size of the feature space only the first 100 samples are
acquired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ix
5.14 Zoo dataset. The average root mean square difference be-
tween the estimated and true error rate after the candidate
feature is added as a function of the number of samples
probed for all sampling policies. Rms difference is averaged
over 100 repetitions of the experiment. . . . . . . . . . . . 98
5.15 Apple Proliferation dataset. The average root mean square
difference between the estimated and true error rate after
the candidate feature is added as a function of the number
of samples probed for all sampling policies. Rms difference
is averaged over 100 repetitions of the experiment. . . . . . 99
5.16 Biomarkers experiment I and II. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 104
5.17 Biomarkers experiment I and II. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 105
5.18 Biomarkers experiment III and IV. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 106
x
5.19 Biomarkers experiment V and VI. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 107
5.20 Biomarkers experiment VII and VIII. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 108
5.21 Biomarkers experiment IX and X. The average root mean
square difference between the estimated and true error rate
after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms
difference is averaged over 500 repetitions of the experiment. 109
5.22 Average rms differences between estimated and true rele-
vances at each sampling step on artificial data for random
and MAC policies. Average performed over 100 repetitions
of the experiment. Only the first 100 sampling steps are
shown to improve readability. Note that true relevances are
those computed from the full dataset, and not the theoreti-
cal ones which are slightly different. . . . . . . . . . . . . . 112
5.23 Estimated relevances at each sampling step for every single
feature on artificial data. Random (dashed line) and MAC
(solid-dotted line) policies are compared. Since there are
three features and 200 instances the x axis goes to 600. . 113
xi
5.24 Average cumulative sampling counts at each sampling step
for each feature on artificial data. The more relevant fea-
tures are sampled more frequently than less relevant features
in case of MAC policy. As a comparison, the random policy
samples features independently of their relevance. . . . . . 114
5.25 The normalized difference between final relevances and esti-
mated relevances at each sampling step is plotted for random
and MAC policies on Zoo dataset (top panel) and Monks
datasets (bottom panel). The value at step 0 (all feature
values unknown) is normalized to 1.0. . . . . . . . . . . . . 115
5.26 The normalized difference between final relevances and es-
timated relevances at each sampling step is plotted for ran-
dom and MAC policies on Solar Flares dataset (top panel)
and Cars dataset (bottom panel). The value at step 0 (all
feature values unknown) is normalized to 1.0. . . . . . . . 116
5.27 Average squared sum of the differences between estimated
and true relevances at each sampling step on artificial and
UCI Solar Flares datasets. Random and MAC policies are
compared to the active active policy that considers only a
small random subset of the missing entries at every sampling
step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xii
Chapter 1
Introduction
In data modeling and analysis it is commonly assumed that data is provided
in advance, and we do not control the sampling process. But in the class of
problems where sampling is expensive often this is not the case: we need to
select which data to collect or measure in order to understand the problem
under investigation while reducing the sampling cost as much as possible.
In this context we aim to have an efficient sampling method to trade off
the need of more information against the cost of obtaining it.
This work focuses on the class of problems where we can incrementally
select which data to collect at the next step with the target of learning a
given concept (i.e. a function to estimate from the data). In the proposed
solution, sampling is interleaved with learning on currently available data
at each step in order to infer which could be the best next sampling step.
The aim is to estimate the target concept efficiently as new samples are
added.
The applications presented in this dissertation are mainly devoted to
the learning problem of feature relevance estimation from a dataset of la-
beled instances, where some (or all) feature values are missing and have to
be collected incrementally. In this setting sampling is performed sequen-
tially, selecting the instances to monitor and collecting some of the missing
1
1.1. SAMPLING EXPENSIVE DATA CHAPTER 1. INTRODUCTION
information about them at each step.
Often, a side-effect of incremental data collection is the occurrence of
missing data patterns, since only partial information is disclosed at each
step. Handling missing data in learning tasks is itself a complex prob-
lem that can be addressed by different techniques. Even though this sub-
problem can be decoupled from the general problem of deciding where to
sample, we will manage both in all practical applications presented here.
In the following we introduce the core problem of this research using
an intuitive example and then generalize it to a much broader scope. We
then illustrate the motivations that support this work and lead us to study
the sampling strategies for expensive data. An exact statement of the
goals of this dissertation together with a schema of the organization of the
remaining chapters will follow.
1.1 Sampling Expensive Data
We illustrate the problem of incrementally sampling expensive data with
a limited budget. In this context we aim at deciding carefully what to
sample at the next step based on the current information. We begin with
an introductory example related to the medical domain to help the reader
get acquainted with the general picture. We then generalize the problem
to a broader scope (even though this work addresses just a part of it).
Note that many details in the example are deliberately omitted. At this
point of the presentation they are inessential to the comprehension of the
problem.
1.1.1 Introductory Example
Assume that we have a method to predict (detect) the presence of a disease
in patients, based just on a single variable: the temperature of the body.
2
CHAPTER 1. INTRODUCTION 1.1. SAMPLING EXPENSIVE DATA
Since the accuracy of the prediction is poor, we want to conduct a medi-
cal study to improve it using a more complex model, based on one more
variable. We aim to add a new variable that together with the tempera-
ture predicts better the presence of the disease. Assume that we have two
new candidate variables: the two new variables describe the result of two
different chemical tests performed on the patient. We denote the chemical
tests as CT 1 and CT 2. The amount of information each of them provides
for improving the prediction of the presence of the disease is initially un-
known. Estimating the amounts of these improvements is the final target
of the study. It will allow us to decide which of the two test should be
added to the new medical protocol to assess the presence of the disease.
For this medical study a given number of subjects is monitored and
their temperatures together with the presence or absence of the disease
are known in advance. Assume that the cost of performing each chemical
test on a patient is high, compared to the limited budget available. The
problem is to carefully select on which patients to perform which chemical
tests. The choice, in principle, should take into account all the information
known in advance characterizing each patient (her temperature and the
presence of the disease, in this case), since this is relevant information for
the final goal of improving the quality of the predictions.
We assume that both chemical tests have the same cost and the budgets
associated with each are equal. We will perform CT 1 and CT 2 the same
number of times. The process of testing is assumed to be incremental.
Whenever we test CT 1 and measure its outcome on a patient, we can
exploit this new information to select the patient on which perform the
next test. In parallel and independently we perform this same process
with CT 2. At the end we will have the two (possibly different) sets of
patients on which the results of the first and second tests are known. The
size of the two sets will be the same.
3
1.1. SAMPLING EXPENSIVE DATA CHAPTER 1. INTRODUCTION
Assume that the disease is known to appear only in people whose tem-
perature is above 37C and always when it is above 39C. We can group
the temperatures of patients into three groups: below 37C (group I), be-
tween 37C and 39C(group II) and above 39C (group III), as shown in
Figure 1.1. We assume that the outcome of the chemical tests can be
discretized: CT 1 at two levels (high, low) and CT 2 at three levels (low,
medium and high). Assume that the underlying joint distribution of the
variable temperature and one chemical test is depicted in Figure 1.1, mean-
ing that one of the two test can improve the prediction and the other is use-
less. Since the joint distribution is not known before performing tests, the
goal is to collect incrementally the values of the tests on selected patients
in order to efficiently estimate the prediction error (i.e. the overlapping
areas in figure 1.1).
abscence
I II
1 2CT CT
T
TT
T
presence
abscence
presence
I II
presenceabscence abscence presence
Med
Lo
Hi
Lo
Hi
I II III I II III
IIIIII
Figure 1.1: Class-conditional distributions of the values of the temperature (T ) across the
three groups (below). Class-conditional joint distributions of temperature and chemical
tests (above) showing that CT 1 is a useless test and CT 2 is useful. Absence of the disease
is denoted by ×, presence by ©.
A naıve approach to the sampling problem can be to disregard all infor-
mation known in advance, and just randomly sample some patients with
4
CHAPTER 1. INTRODUCTION 1.1. SAMPLING EXPENSIVE DATA
one of the two chemical tests. We call this approach random sampling
policy. Another strategy could sample the same number of patients having
the disease and not having it. A cleverer strategy could note that there is
no gain in sampling patients in group I or group III, since in those cases
the temperature by itself is enough to decide the presence or absence of
the disease. Many other ad hoc improvements can be implemented with
similar ideas.
The method proposed in this research exploits the available information
in a principled way. To summarize our main result in the context of this
example, we prove (under mild assumptions) that given a chemical test,
the optimal strategy is to select the patient from whom we expect the
biggest expected change with respect to the current estimates of the error
rate of the prediction method. In other words, we estimate the expected
change in the predicted error rate when we perform that chemical test on
a given patient. Inside the expectation we do not take into account just
one possible outcome of the test (e.g. the most likely), but all possible
results weighted by their probability of occurrence. Those probabilities are
estimated from the data already acquired.
From the point of view of the experimental process, after choosing one
chemical test, we compute the expected change in error rate for each pa-
tient not tested so far and select the one having the highest expected
change. On this patient then we perform the actual chemical test and add
the outcome as a new value to the current dataset. Then we update the
current error rate estimate. At the next step we will recompute again the
expected change for every patient left (one fewer than before). The process
is described in Figure 1.2.
At the end of the experimental session, when the budget is exhausted,
we perform the same process on the other chemical test. After that we can
compare the reduction of error rates by both tests and decide to adopt the
5
1.1. SAMPLING EXPENSIVE DATA CHAPTER 1. INTRODUCTION
best one in the new improved protocol to detect the disease. A possible
outcome of that comparison could be that both new candidate chemical
tests are discarded because the estimated improvement in accuracy is too
little in both cases, compared to the costs of including either.
1.1.2 Extended Example
Based on the previous example we illustrate an extended version [45]. As-
sume now that each patient can be described by many variables: temper-
ature, weight, age, marital status, result of chemical test CT 1, of CT 2,
etc. but we know the values of only some (or none) of them. This means
that the initial dataset, where each record describes a patient along with
those variables, has some (or no) known values. We want to assess the
importance of each variable for predicting the presence of the disease and
we have a limited budget to spend for measuring the missing values in
the dataset. Differently from the previous example, this time we want to
allocate the total budget in a more complex way, deciding at each step for
which patient and for which variable to acquire a new value. This new set-
ting lets the variables compete one against the other in the allocation. In
this new example it could happen that some variables are measured more
times than others. In this case the sampling strategies to be applied for an
efficient estimation of the target concept appear to be more complex, due
to the possible correlations between variables.
As in the first example, after a value of the variable is measured (e.g.
CT 1 is performed on patient 35) and its result is added to the dataset,
we exploit the result together with all the information available up to
then to decide which variable to extract on which patient next. At the
end of experimental session we will have more (or all) entries known and
make the final assessment of the importance of each variable. The process
is described in Figure 1.3, where the feature relevance is defined as the
6
CHAPTER 1. INTRODUCTION 1.1. SAMPLING EXPENSIVE DATA
error rate = 0.31
error rate = 0.12
error rate = 0.26
StoppingCriterion
QueryActualValue
Exit
StoppingCriterion
Exit
StoppingCriterion
Exit
Select TopBenefit
Error RateEstimate
?
Estimate Expected Benefits
Top Expected Benefit:
= argmax(B(i))
B(2) = .05
i *
O
C T 2
1
2
3
4
X
X
O
O
II
II
CT
5 II
.01
.00
.05
.04
I
III
C T
1
2
3
4
II
?
?
II
CT
5 II
2
O
O
O
X
X I
III
?
Med
Med
O
C T 2
1
2
3
4
X
X
O
O
II
?
?
II
CT
5 II
I
III
Med
Lo
?
O
C T 2
1
2
3
4
X
X
O
O
II
?II
CT
5 II
I
III
Lo
Lo
Med
Hi
Figure 1.2: Active sampling process for error rate estimation. The entries shown in gray in
the dataset in the top left corner indicate where the value of chemical test CT2 is missing.
The right side shows the process of computing the benefit of sampling at a particular
missing value. The missing value with the highest benefit is chosen and actually sampled.
The process ends when the budget is over (Stopping Criterion).
7
1.1. SAMPLING EXPENSIVE DATA CHAPTER 1. INTRODUCTION
mutual information between a variable and class labels (see Section 4.6.5).
In this example we introduce a cost model. We assume that each vari-
able has its own extraction cost (equal across all patients). As an example,
measuring the temperature of a patient has a much lower cost than per-
forming a chemical test. This means that in case of equal importance it
is desirable that temperature will be collected more frequently. We will
define precisely the cost model only later.
1.1.3 Abstraction: Class of Problems of Interest
The two examples discussed so far describe two scenarios that we actually
studied and implemented in this research. We can generalize them to a
class of problems, and define it by:
• A set of monitored instances on which we want to estimate a target
concept, e.g. a set of patients on which we study the importance of
two chemical test to predict the presence of a disease.
• A target concept to be estimated, in terms of point estimation1, e.g.
a classifier error rate, the relevance of a variable with respect to the
class label etc.
• Sampling constraints, defining the sample space where we have to
choose one element to sample at each step. In the first example we
allow one chemical test on one patient at a time, so at each step the
sample space was the set of single patients on whom the test was
not conducted until then. If we assume different constraints, e.g. we
have to do batches of 3 chemical tests at a time, then the sample space
becomes the set of all possible triplets of patients not yet tested. These
kinds of constraints are domain dependent.
1see Chapter 3 for motivations about this restriction to point estimates.
8
CHAPTER 1. INTRODUCTION 1.1. SAMPLING EXPENSIVE DATA
StoppingCriterion
1 0 2 0 1
C X1 X2 X3 X4
1
2
3
4
5
0
0
1
1
1 1
2 0
0
0 01
0.8 0.2
1.5
0.9 0.4
0.5 0.7 1.1
EstimateFeature Relevances
1 0 2 0 1
C X1 X2 X3 X4
1
2
3
4
5
0
0
1
1
1 1
2 0
0
0 01 ???
? ??? ?
QueryActualValue
Select TopBenefit
1 0 2 0 1
C X1 X2 X3 X4
1
2
3
4
5
0
0
1
1
1 1
2 0
0
0 01
??
? ??? ?
1
Exit
StoppingCriterion
Exit
StoppingCriterion
Exitg g g g
1 2 3 4
g g g g1 2 3 4
g g g g1 2 3 4
Estimate Expected Benefits
Top Expected Benefit:
= argmax(B(i,j))x23
B(2,3) = 1.5
1 0 2 0 1
C X1 X2 X3 X4
1
2
3
4
5
0
0
1
1
1 1
2 0
0
0 01
?
??
1
1
0
1 0
Feature Relevances:
0.010.480.55
Feature Relevances:
0.010.250.490.51
Feature Relevances:
0.470.530.060.28
0.11
Figure 1.3: Active sampling process for feature relevance estimation. The entries shown in
gray in the dataset in the top left corner indicate the ones that are missing at a particular
instance. The bottom right hand corner shows the process of computing the benefit of
sampling at a particular missing value. The missing value with the highest benefit is
chosen and actually sampled. 9
1.2. MOTIVATION CHAPTER 1. INTRODUCTION
This abstract problem will be addressed by proposing a single general
criterion. Some implementations of the criterion will be provided for a few
cases, where precise assumptions and constraints allow full derivations.
1.2 Motivation
In the following we present two practical problems in two domains of ap-
plication to motivate the need of scientific investigation of this research
topic. These two problems are related to the examples introduced ear-
lier. We will derive full implementations of the proposed method for these
problems later.
1.2.1 Agricultural Domain
Our interest in the topic of this research was aroused by a research project
(SMAP)2 in the domain of agriculture, dealing with the Apple Prolifera-
tion [51] disease in apple trees. Biologists monitor a distributed collection
of apple trees affected by the disease with the goal of determining the symp-
toms that indicate the presence of the disease causing phytoplasma. A data
archive is arranged with a finite set of records, each describing a single ap-
ple tree. All the instances are labeled as infected or not infected. Each year
the biologists propose new candidate features (e.g. color of leaves, altitude
of the tree, new chemical tests, etc.) that could be extracted (or measured)
to extend the archive, so as to arrive at more accurate models. Since the
data collection in the field can be very expensive or time consuming, a data
acquisition plan needs to be developed by selecting a subset of the most
relevant candidate features that are to be eventually acquired on all trees.
2This work was funded by Fondo Progetti PAT, SMAP (Scopazzi del Melo - Apple Proliferation), art.
9, Legge Provinciale 3/2000, DGP n. 1060 dd. 04/05/01.
10
CHAPTER 1. INTRODUCTION 1.2. MOTIVATION
1.2.2 Biomedical Domain
More recently we investigated a problem in cancer characterization. In the
biomedical domain, the acquisition of data is often expensive. The cost
constraints generally limit the amount of data that is available for analysis
and knowledge discovery.
In biological domains molecular tests, called biomarkers are sudied.
They are conducted on samples of tumor tissue. Biomarkers provide predic-
tive information to pre-existing clinical data. The development of molecu-
lar biomarkers for clinical application is a long process that must go through
many phases starting from early discovery phases to more formalized clin-
ical trials. This process involves the retrospective analysis of biological
samples obtained from a large population of patients. The biological sam-
ples that need to be properly preserved are collected together with corre-
sponding clinical data over time and are therefore very valuable. There is
therefore a need to carefully optimize the use of the samples while study-
ing new biomarkers. We address the issue of cost-constrained biomarker
evaluation for developing diagnostic/prognostic models for cancer.
In our problem new biomarkers are tested on biological samples from
patients who are labeled according to their disease and survival status.
Moreover for each patient we have additional information such as grade
of the disease, tumor dimensions and lymphonode status. That is, the
samples are class labeled as well as described by some previously observed
features. The goal is to choose the new feature (the biomarker) that is
most correlated with the class label given the previous features. Since the
cost involved in the evaluation of all the biomarkers on all the available
data is prohibitive, we need to actively choose the samples on which the
new features (biomarkers) are tested. Therefore our objective at every
sampling step is to choose the query (the sample on which the new feature
11
1.3. GOALS CHAPTER 1. INTRODUCTION
is measured) so as to learn the efficacy of a biomarker most accurately.
1.3 Goals
How to decide which data to collect next among multiple options, given
that we have already some (or none) of the data and we want to learn
a target concept? This work presents a general criterion to address this
kind of question and provides practical implementation when the concept
to learn is feature relevance.
In this research we will provide a formal description of the sampling
problem introduced so far, a general solution derived from the theory of
experimental design, a set of examples and applications on which the gen-
eral solution will be implemented, together with simulation experiments to
demonstrate its effectiveness.
Implementation and experimental evaluation of the method proposed in
this research have been done just in a few simpler cases. We focus more
on simple cases to allow deeper analysis of the problem. In particular, we
present implementations of:
• Learning a step function.
• Estimating the conditional probability between two binary variables.
• Estimating the importance of new candidate variables, one at a time,
when added to a class labelled dataset. The importance is defined as
the improvement in the error rate of a maximum a posteriori classifier
built on all the available data.
• Estimating concurrently the importance of new candidate variables,
when added to a class-labelled dataset. The importance is defined as
the mutual information between class labels and variables values.
12
CHAPTER 1. INTRODUCTION 1.4. ORGANIZATION
In all cases the proposed sampling method will require less samples to reach
the target concept, than other methods in common practice and proposed
in the literature.
1.4 Organization
The structure of this dissertation is as follows: in Chapter 2 we review the
main works in the many areas addressed by this research topic. In Chap-
ter 3 we define formally the generalized problem, introduce the necessary
background notions and derive the proposed method, together with an in-
tuitive interpretation related to information theory. In Chapter 4 we focus
on two examples and two applications and derive full implementation of
the proposed criterion. In Chapter 5 we presents the results of experiments
conducted on the examples and applications introduced earlier, to support
the theoretical achievements of this research. In Chapter 6, we summarize
the results and propose interesting directions for future research.
13
1.4. ORGANIZATION CHAPTER 1. INTRODUCTION
14
Chapter 2
Related Work
Data acquisition has traditionally been studied in statistics under the topic
of experimental design and in machine learning under the topic of active
learning. The main topic of this research deals with the acquisition of miss-
ing values in a partially filled dataset, constrained by limited resources and
incremental acquisition policy, to learn a given concept. We can call this
topic active feature sampling. Below a review of previous related work is
presented, starting from experimental design and machine learning. Since
the problem of active feature value acquisition triggers many issues, a brief
review of related topics is presented, on: handling missing data, sample
selection bias and features selection.
2.1 Experimental Design
The design of experiments involves decisions about all aspects controlling
an experiment before and during its life. Usually some information is
available in advance motivating the use of Bayesian methods, leading to
a branch of Bayesian statistics called Bayesian experimental design that
dates back to the ’70 [17]. Non-Bayesian approaches are present in litera-
ture but they are less popular. We don’t review non-Bayesian experimental
design literature since our research follow the Bayesian approach. A com-
15
2.2. ACTIVE LEARNING CHAPTER 2. RELATED WORK
prehensive discussion on non-Bayesian experimental design can be found
in [16, 18].
The basic idea of experimental design is that statistical inference about
the quantities of interest can be improved by appropriate selection of the
values of the control variables of the experiment. According to Lindley [32],
designing experiments in the Bayesian setting means defining a utility
(or benefit) function that reflects the purpose of the experiment and is
parametrized by data and control variables. The target is then to max-
imize this utility function. Many criteria (expressed as different utility
functions) have been proposed in experimental design literature, depend-
ing on the optimization target and leading to an alphabetical taxonomy
when applied to the normal linear regression model. Among them the most
popular (and relevant for this research) are:
• The expected gain in Shannon information [9] between prior and pos-
terior distribution (i.e. before sampling and after sampling). This
utility function leads to the Bayesian D-optimal design.
• The expected weighted distance between the true and expected es-
timates after sampling. This utility function leads to the Bayesian
A-optimal design.
A thorough review of Bayesian experimental design and its taxonomy for
linear and nonlinear models is [6]. For Bayes D-optimal design see [2, 17,
50].
2.2 Active Learning: Sampling Labels
The application of the theory of optimal experiments to machine learn-
ing leads to interesting problems related to implementation issues such
as finding good approximations to the theory and learning with sampling
16
CHAPTER 2. RELATED WORK 2.3. ACTIVE FEATURE SAMPLING
bias (the side effect of non i.i.d. sampling). Traditionally this area, called
active learning, is dominated by the problem of minimization of the num-
ber of class labels that has to be supplied for training. In [35] MacKay
studies sampling bias and expected informativeness of candidates for ac-
quiring class labels for neural networks and in [7] active learning is used
for learning Gaussian mixtures. A Bayesian formulation of active learning
for function approximation is presented in [56]. Seung et al. present the
query by committee algorithm and prove that actively selecting samples to
label can lead to exponentially faster decrease in generalization error than
random selection [53, 47]. Active learning has been successfully applied to
text classification using support vector machines [58]. In contrast to this
traditional definition of active learning we study the incremental acquisi-
tion of feature values which presents new implementation issues related to
learning with missing data and dealing with continuous features.
2.3 Active Feature Sampling
A new branch of active learning, different from the previous one, addresses
the problem of minimizing feature samples (instead of label samples) for
feature selection, classifier induction and related learning problems: this
branch is exactly the area of our research and inside it falls the problem
described in the introduction of the dissertation (see Chapter 1). Here the
class labels are all known in advance but some (or all) feature values are
unknown and a strategy has to be set up in order to acquire the values.
The main works that started to give solutions to this class of problems
address the question of inducing a classifier using active policies for feature
values acquisition. Even though the main target of this research is feature
evaluation (and not classifier induction), the comparison against that lit-
erature still makes sense since the core steps of the decision processes are
17
2.3. ACTIVE FEATURE SAMPLING CHAPTER 2. RELATED WORK
equivalent. Therefore some of the methods available in the literature can
be adapted to the feature evaluation problem.
Zheng and Padmanabhan in [65] analyze the problem of user modeling
and propose a goal-oriented data acquisition (GODA) policy independent
of the model and based on this idea: estimate the variance of imputed
(missing) data and select, as the next instance on which measure all missing
features, the one that has the highest variance. The underlying principle is
that probing where uncertainty is higher leads to better improvement of the
model. In their work Zheng and Padmanabhan lack a principled approach
to the measurement of the uncertainty. They take into account previously
known feature values only where data was fully collected, loosing valuable
information. In addition, feature extraction is performed on all missing
features each time without any consideration of different feature relevances.
In our proposed solution features are sampled independently on a base of
estimated relevance for improving the prediction. Other limitations of [65]
are the use of imputation as the only answer to the problem of missing
values and the absence of a cost model for different features.
Lizotte et al., in [34], present a method to select sequentially pairs of
instance-feature to be measured for classifier induction under cost-constraints.
They base their solution on a Naıve Bayes classifier and compare different
acquisition policies, they propose then a look-ahead strategy. This is a
multi-step policy that estimates the expected benefit of probing all possi-
ble sequences (of a given length) of available instance-feature pairs to find
out where it is most interesting to sample next, rather than using a myopic
(i.e., single step) method. One may criticize the choice of the classifier that
solves some computational issues but relies on strong assumptions (feature
independence). The look-ahead strategy is computationally heavy due to
the exponential explosion of the configuration space.
In [38, 39] Melville et al. presented the same problem of [34] with a dif-
18
CHAPTER 2. RELATED WORK 2.3. ACTIVE FEATURE SAMPLING
ferent approach, where instance-features are ranked at each step according
to the difference of their current probabilities of the two most likely classes.
This approach, called Error Sampling, is proposed together with the use
of decision trees as classifier (even though the method is general and not
bound to it) and a comparison is made against [65] showing an increase
in performance. What is missing from this work is a principled approach
to the solution of the problem even though the method is claimed to be
inspired by optimum experimental design. A model for different cost of
the features is absent.
In [30] Krishnapuram et al. propose a method for active acquisition of
features values and class labels in a setting where both of them are (par-
tially) missing and the goal is learning a classifier. The method implements
logistic regression via maximum likelihood estimation, independently on
each variable, and selects the next sample to measure by estimating the
mutual information between the classifier weight vector and each candidate
either in the case of a feature value or a class label. The motivation of an
information based criterion for active data acquisition is just referred to
the previously mentioned article about label acquisition by MacKay [35],
without justifying the different context (feature value acquisition). Other
objections are the missing cost model and the target of classification tasks
only.
Williams in [62] and Williams et al. [63] extend Krishnapuram’s ap-
proach proposing a framework to guide data collection (both on features
and class label) based on risk minimization. The benefit of sampling a
given missing entry or set of missing entries is computed as the difference
between the current expected risk of misclassification minus the expected
risk of misclassification after sampling. The classification step is imple-
mented using logistic regression and distributions are modeled as a Gaus-
sian mixture model. An extension to kernel methods [4, 49] is provided to
19
2.4. MISSING DATA AND SELECTION BIAS CHAPTER 2. RELATED WORK
take into account non-linear classification tasks. An extension to the semi-
supervised case is provided as well. A main limitation of this approach is
that the target is restricted to classification: it is unclear how to extend it
to other learning tasks since the method is intimately based on classifica-
tion risk. Moreover Williams claims that the goal of active data acquisition
ends when the classifier is learned, not taking into account the need for ac-
tive acquisition during deployment of the classifier. A further limitation is
the requirement that misclassification costs and extraction costs must be
specified in the same units.
To the best of our knowledge, few other works are related to this exact
topic. They focus on different aspects of the problem (see [54], on active
feature acquisition for testing) or are comprised in the literature discussed
so far (see for example [66]).
2.4 Missing Data and Sample Selection Bias
The study of active feature sampling must address the problem of handling
missing values in datasets when making inferences. At every sampling step
the dataset has missing values corresponding to what has not yet been
acquired. Even though some classifiers are able to deal with that problem
(e.g. Bayesian MAP, decision trees, etc.) the common solution is to fill-in
(impute) the missing values in some way. The main reference in this area
is the book of Little and Rubin [33] which proposes a simple schema to
discriminate situations in which data is missing at random or not. This
difference affects the solution of the inference problem. The book describes
many methods to perform imputation as well. More recent work [29, 63]
presents more complex methods for specific classifiers.
A byproduct of active sampling policies is that sampling does not follow
the underlying distributions of the data. This introduces a bias which can
20
CHAPTER 2. RELATED WORK 2.5. FEATURE SELECTION
affect strongly all inferences. In the machine learning area the concept
of sample selection bias has been discussed by Zadrozny et al. [64]: the
problem of bias during selection of samples is analyzed formally for many
popular learning methods (Bayesian, logistic, decision tree, SVM) which is
a relevant issue to active sampling. Along with demonstrating the effect
of sample selection bias on most of them, this article shows how to take it
into account and correct the estimation.
2.5 Feature Selection
Although one of the goals of active feature value acquisition might be
feature selection we emphasize that the focus of our research is data acqui-
sition. Traditionally in the feature selection domain data is already fully
collected over all instances and features. Those features are then ranked
or grouped based on the predictive power they have on class labels. Com-
prehensive surveys on feature selection are [27, 22]. Since the core problem
of acquiring data incrementally is not addressed we do not review research
in this area.
2.6 Summary
In this chapter we introduced the literature of the main scientific areas in-
volved or related to this research. The problem of guiding data acquisition
has traditionally been studied in statistics, under the topic of experimen-
tal design. The Bayesian approach has been called Bayesian experimental
design. A list of references has been provided.
In the area of machine learning, a related problem (i.e., which records
of a dataset to acquire labels for efficient classification) has been studied
under the name of Active Learning. In this research we are interested in
21
2.6. SUMMARY CHAPTER 2. RELATED WORK
querying the variables describing records, instead of only class labels. This
problem has been studied only recently (and by few authors), under the
name of Active Feature Sampling. We discussed all the relevant works in
this area evidencing the limits of current approaches.
Other related topics are touched by our research: handling missing data,
sample selection bias and features selection. We reviewed them briefly and
gave pointers to the main literature.
22
Chapter 3
Problem Statement and Sampling
Algorithm
In this Chapter we give a detailed definition of the problem under investiga-
tion and introduce the necessary notation together with some background
knowledge. Then we derive one of the main results of this research: a gen-
eral sampling algorithm to decide where to sample among the missing data.
This algorithm will rank each candidate to be sampled, thus providing to
the experimenter the necessary information to select the most promising
one. This best candidate is expected to yield the highest contribution to
improve the estimation of the target concept, under mild assumptions.
The basic idea of the proposed algorithm, called Maximum Average
Change (MAC) sampling algorithm, is to assess the changes in the value
of the target concept between the current estimate and the expected es-
timates obtained by measuring each candidate missing entry. The change
is computed as an expectation over all possible values that the candidate
missing entry can assume. Then, the candidate yielding the highest change
is actually sampled.
The proposed algorithm is derived from an optimality criterion from
Bayesian experimental design, in a form that is convenient for incremental
sampling problems. At the end of the Chapter, an intuitive interpreta-
23
3.1. DEFINITIONS CHAPTER 3. PROBLEM AND ALGORITHM
tion of the sampling algorithm is given in terms of information theory.
The reader is assumed to be familiar with basic concepts of mathematical
statistics and estimation theory.
3.1 Definitions and Notation
• Pattern instance s: a pattern instance is a monitored subject under
investigation. Example: a patient on which we conduct the medical
study.
• Instance space S: the set of all possible pattern instances. Example:
the set of all patients in a medical study.
• Sample space (or observation space), Ω: the set of all possible out-
comes of an experiment, a measurement, or a random trial. Example:
in a medical study the sample space can be the set comprising two
outcomes: presence and absence of the disease under investigation.
• Variable or Feature space, X : the set of distinct numerical values
representing all possible outcomes of an observation/experiment. Ex-
ample: the absence or presence of a disease can be represented using
0, 1. In the following we will use the terms “variable” and “feature”
with the same meaning.
• Variable (or Feature), X: a random variable describing a specific prop-
erty of the pattern instance. It maps the outcome of an experiment
to a given value, X : Ω → X . Example: in the experiment of finding
the presence of a disease in a patient, X returns 0 in case of absence
and 1 in case of presence.
• Dataset D: a dataset is a set of records describing pattern instances.
Each record corresponds to an ordered set of values describing a fea-
24
CHAPTER 3. PROBLEM AND ALGORITHM 3.1. DEFINITIONS
ture/variable value for a given pattern instance. We can represent
the dataset as a N × F matrix D, where N is the number of pattern
instances under observation and F is the number of features/variables
we can collect for each pattern instance. A dataset can have missing
values, meaning that some entries of the matrix can be unknown. Dur-
ing the process of incremental sampling we will call Dk the dataset at
sampling step k (before sampling). We assume the rows of the dataset
to be independently and identically distributed.
• Parameter space Θ: the set of all possible parameter values of the
probability distribution of a random variable. Example: for a random
variable following the Bernoulli distribution on parameter p, Θ =
[0, 1].
• Decision rule, strategy or decision function, φ: it is a function that
states which decision (e.g. estimate) to take when the experiment
yields a given outcome, φ : X → ∆. Example: consider the exper-
iment that generates N independent measures of a quantity as out-
come; then the arithmetic mean is a strategy that associates a decision
(i.e. an estimate) with the outcome of the experiment.
• Concept G: a concept learnt (or estimated) from currently available
data is a random vector function G : D → RQ. Example 1: the error
rate of a given classifier on the data collected (Q = 1). Example 2:
the vector of relevances (see Section 1.1.2) of the variables described
by records (Q = F ).
• Missing and observed entries, Dm and Do: in a dataset D we use
the superscript m (i.e. Dm) to denote the missing entries, and the
superscript o (i.e. Do) for the observed ones.
25
3.2. PROBLEM STATEMENT CHAPTER 3. PROBLEM AND ALGORITHM
• Set of candidates (or designs), Z = Z1, ...,ZM: in a dataset with
missing entries, it is the set of all possible candidates to be measured
at the current sampling step. Each candidate is a subset of the current
missing entries, defined by domain constraints. Example 1: if we are
allowed to measure only one missing entry at a time, then Zi is just
the i-th missing entry and Z is equal to the set of current missing
entries Dm. Example 2: if we can only measure batches of three
missing entries at a time, then Z is the set of(Nm
3
)triplets made from
the current Nm missing values. Example 3: if we can access each
instance only once , then we have to measure all missing entries of
that instance at the same time; in this case Z is the set of instances
having at least one missing value. In general Z ⊂ P(Dm), where
P(Dm) is the powerset of the missing entries.
3.2 Incremental Feature Sampling for Learning a Con-
cept
We consider the problem of incremental data collection on a given set of
pattern instances in order to estimate a concept from data. Each pattern
instance can be described by a given set of variables. Initially the dataset
is partially filled (or empty) and we have to decide how to fill it under
some constraints, with the goal of efficiently improving the estimation of
the concept under investigation. These constraints are:
• Limited resources: measuring variables over subjects is costly and we
assume that it is not feasible to measure every missing value but just
a subset.
• Domain constraints: in practical applications it is not feasible to col-
lect any possible subset of the missing entries but only some specific
26
CHAPTER 3. PROBLEM AND ALGORITHM 3.2. PROBLEM STATEMENT
subsets. Limits in instrumentation, or in the measurement process in
general, restrict the set of choices. It is common to be constrained
to collect just one missing entry at a time, or a batch of missing en-
tries of given size, or all missing entries for a given pattern instance
(e.g. if the measurement destroys the pattern instance, as in the case
illustrated in Section 4.5.1).
Furthermore, we are allowed to measure all new data in a sequence of
steps instead of all at once. Every time we collect some new information
we can use it to help decide what to collect next.
3.2.1 Missing Data Mechanism
Consider a rectangular dataset D where rows are drawn independently and
identically distributed (i.i.d.) from a probability distribution. We define
and indicator matrix I such that
Iij =
1 if xij is observed
0 if xij is missing(3.1)
where xij is the j-th feature value of i-th instance in the dataset. Let
the joint probability distribution of the data and the indicator matrix be
parametrized by θD, for the process generating the data, and θI for the
missingness mechanism:
p(D, I|θD, θI) = p(D|θD)p(I|D, θI) (3.2)
and let Do be the observed data and Dm the missing data. Little et al. [33,
48] defines three types of missingness mechanisms:
• Missing completely at random (MCAR): when p(I|D, θI) = p(I|θI),
meaning the missingness of a value is independent of all data previ-
ously acquired or the other missing values.
27
3.2. PROBLEM STATEMENT CHAPTER 3. PROBLEM AND ALGORITHM
• Missing at random (MAR): when p(I|D, θI) = p(I|Do, θI), meaning
that a missing value may depend on the available values but not on
the missing ones.
• Not missing at random (NMAR): when p(I|D, θI) = p(I|Do, Dm, θI),
meaning that missing values may depend both on available data and
missing data.
In MAR and MCAR case it can be proved [33, 48, 19] that the joint prob-
ability distribution of the observed data and indicator matrix is equivalent
to
p(Do, I|θD, θI) = p(Do|θD)p(I|Do, θI) (3.3)
meaning that the estimation of θD can ignore θI , so we can estimate the
parameters governing the distribution of data without taking into account
the missingness mechanism. In the NMAR case this is not true and the
dependence between the two sets of parameters has to be taken into ac-
count.
In this research we assume that missing data is either missing completely
at random (MCAR) or missing at random (MAR). Since the proposed
method for incremental sampling exploits just the known data in order to
decide which is the best next sampling step, we can safely claim that the
MAR assumption holds. We have to assume the initial data being MAR
only in case an already partially filled dataset is provided to the method at
the first sampling step. When we conduct experiments on the asymptotic
behavior of the sampling policies where we begin with empty dataset, then
the MAR assumption on the initial dataset is always true. See details in
Chapter 5 about protocols of experiments.
28
CHAPTER 3. PROBLEM AND ALGORITHM 3.2. PROBLEM STATEMENT
3.2.2 Bayesian Estimation Theory
We consider the problem of estimating a random variable X whose dis-
tribution depends on a parameter vector θ. Given a parameter space Θ
containing θ and a decision space ∆ of all possible decisions (or estimates),
we can define a loss function L : Θ×∆ → R that expresses the loss we in-
cur when we decide to take a decision δ ∈ ∆ (i.e. we choose δ as estimate),
when the true state of the nature (i.e., the true value to estimate) is θ ∈ Θ.
Let φ : X → ∆ be a decision function assigning decision δ = φ(X) when
X is observed. We define the risk function R : Θ×∆ → R as the expected
loss with respect to X:
R(θ, φ) = E[L(θ, φ(X)] =
∫
X
L(θ, φ(x))pX|θ(x|θ)dx (3.4)
Bayes estimation theory [10, 41, 14] defines the Bayes risk as the ex-
pectation of the risk with respect to an assumed prior distribution Pθ on
θ:
r(Pθ, φ) = E[R(θ, φ)] =
∫
Θ
R(ϑ, φ)pθ(ϑ)dϑ. (3.5)
(where Pθ has p.d.f. pθ(ϑ)) and prescribes that in order to minimize the
loss due to incorrect estimation we have to minimize the associated Bayes
risk with the decision function φ∗ s.t.:
φ∗ = arg minφ
r(Pθ, φ) (3.6)
The Bayes risk is minimized when for each x when the action φ∗(x) is
taken, where φ∗(x) is given by
φ∗ = arg minφ
∫
Θ
L(ϑ, φ(x))pθ|X(ϑ|x)dϑ. (3.7)
The Bayes decision rule minimizes the posterior conditional expected
loss given the observations.
29
3.2. PROBLEM STATEMENT CHAPTER 3. PROBLEM AND ALGORITHM
In case of squared-error loss:
L(θ, δ) = (θ − δ)2 (3.8)
the posterior expected loss given X = x:
E[L(θ, δ)|X = x] =
∫
Θ
(ϑ − δ)2pθ|X(ϑ|x)dϑ (3.9)
is minimized by taking δ as the mean of the posterior distribution [10, 41]:
θ = φ(x) = δ = Eθ[pθ|X(ϑ|x)] =
∫
Θ
ϑfθ|X=x(ϑ|x)dϑ. (3.10)
This estimate θ is called minimum mean-square estimate of the true value
θ and is denoted by θMS.
3.2.3 Bayesian Experimental Design
Bayesian experimental design [31, 6] provides a recipe to maximize util-
ity functions (U) in order to find the optimal choice (or design), among
alternatives, that the experimenter operates in order to pursue the most
effective experiment.
Let η ∈ H be a design and x ∈ X be the results of an experiment defined
by η. Let δ ∈ ∆ be a decision to take after observing x. Let θ ∈ Θ be
the unknown parameters of the problem. Then, a general utility function
is of the form U(δ, θ, η,x). The optimal design, according the framework
of Bayesian experimental design, is the one that maximizes the expected
utility of the best decision, that is:
η∗ = arg maxη∈H
∫
X
maxδ∈∆
∫
Θ
U(δ, θ, η,x)p(θ|x, η)p(x|η)dθdx (3.11)
Many utility functions has been proposed in literature. Among the most
popular are:
30
CHAPTER 3. PROBLEM AND ALGORITHM 3.3. MAC: DERIVATION
• The expected gain in Shannon information obtained by sampling a
design. Maximizing this utility function means maximizing the ex-
pected Kullback-Leibler divergence D(||) between the posterior (after
sampling) and the prior (before sampling) distribution [31]:
η∗ = arg maxη∈H
∫
X
∫
Θ
logp(θ|x, η)
p(θ)p(x, θ|η)dθdx =
= arg maxη∈H
∫
X
∫
Θ
D(p(θ|x, η)||p(θ))p(x|θ, η)dθdx. (3.12)
This kind of utility function leads to Bayesian D-optimal design when
derived on the normal linear regression model.
• According to Chaloner et al. [6], in case of experiments aimed to obtain
point estimates, a quadratic loss function may be appropriate, where
the loss is between the expected and estimated values. The utility
function to maximize gives:
η∗ = arg maxη∈H
(−
∫
X
∫
Θ
(θ − θ)TA(θ − θ)p(x, θ|η)dθdx
)
where A is a symmetric non negative definite matrix. This utility
leads to Bayesian A-optimal design when derived on the normal linear
regression model.
3.3 Maximum Average Change (MAC): Derivation
Let D = dii=1,...,N be a possibly incomplete dataset of records di each
corresponding to a pattern instance. The missing data pattern is assumed
to be missing at random (MAR). Let the random vector X = (X1, ..., XF )
corresponds to F variables that can be measured on any pattern instance
taking values in X1 × ... × XF . So each record is a random vector di =
(X1, ..., XF ).
31
3.3. MAC: DERIVATION CHAPTER 3. PROBLEM AND ALGORITHM
Let θ ∈ Θ be a random vector parametrizing the joint distribution over
X1 × ... ×XF . Let the random vector G(θ) = (G1(θ), ..., GQ(θ)) : Θ → RQ
be a vector function representing a concept we want to estimate from data.
At sampling step k we denote the partially filled dataset as Dk. The
Bayesian minimum mean squared error (MMSE) estimate of G given Dk
is given by Gk = G(Dk) = EG[G|Dk]. The mean quadratic error (MQE)
of the estimate with respect to the true value at step k is then:
MQEk =
∫
G
(Gk − G)TA(Gk − G)p(G|Dk)dG. (3.13)
where A is a symmetric non negative definite matrix introduced for simi-
larity to the utility functions of Bayesian experimental design (see Equa-
tion 3.13. In our case, the practical meaning of A depends on the exact
definition of the random vector function G; it can be used to weight each
component of the target concept. In the simplest case A would be a di-
agonal matrix where each element represents the cost of measuring the
corresponding variable. When G is a function to assess the importance of
the variables (i.e. Q = F ), A can be used to embed a cost model inside
the sampling algorithm. Non-diagonal terms can be used to express more
complex cost models, e.g. measuring two features together has a different
cost than sampling them separately. If A is the identity matrix I, the MQE
is the mean squared error (MSE):
MSEk =
∫
G
|Gk − G|2p(G|Dk)dG (3.14)
where |.| is the Euclidean norm.
Let Dmk be the set of all missing entries in Dk. Let Zk+1 = Z1, ...,ZM
be an indexed subset of the power set P(Dmk ) representing the set of dif-
ferent candidates (designs) among which the experimenter has to decide
for sampling at step k + 1. The set of constraints defining Z is problem
dependent. Each Zk+1l can be represented as a set of pairs (i, f). Each
32
CHAPTER 3. PROBLEM AND ALGORITHM 3.3. MAC: DERIVATION
pair (i, f) represents one missing entry in the current dataset, about the
record di for which the value of Xf is currently unknown.
If we assume that at step k + 1 the subset of missing entries Zk+1l of
the dataset is measured obtaining the values v = v1, ..., vNk+1
l ∈ Vk+1
l ⊂
(X1 × ... × XF ), then the new dataset, denoted as Dk+1 or (Dk,Zk+1l =
v), has the values v in the missing entries described by Zk+1l . The new
MMSE estimate of G as a function of (l,v), will be Gk+1 = G(Dk+1) =
EG[G|Dk,Zk+1l = v]. From now on we rename Zk+1
l → Zl and Vk+1l → Vl
for brevity.
The new mean quadratic error, as a function of (l,v), is then:
MQEk+1(l,v) =
∫
G
(Gk+1 −G)TA(Gk+1 −G)p(G|Dk,Zl = v)dG (3.15)
Since we do not know in advance what value would be obtained if we
did sample at Zl, we need to average the above quantity over all possible
outcomes in order to estimate the predicted mean quadratic error (denoted
MQEk+1(l):
MQEk+1(l)=
∫
Vl
∫
G
(Gk+1−G)TA(Gk+1−G)p(G|Dk,Zl=v)p(Zl=v|Dk)dGdv
(3.16)
The set of missing entries Zl∗ that minimizes the quantity above is the one
yielding the lowest predicted mean quadratic error, where:
l∗ = arg minl∈1,...,Mk+1
MQEk+1(l) (3.17)
This criterion is an application of the theory of Bayesian experimental
design (see Equation 3.11) when the utility function is Equation 3.13 where
θ = G, x = v and the design η is defined by the index l.
Now we illustrate an equivalent formulation of Equation 3.16 that is
more convenient for problems dealing with incremental sampling.
33
3.3. MAC: DERIVATION CHAPTER 3. PROBLEM AND ALGORITHM
Lemma 3.3.1 In order to minimize the predicted mean quadratic error of
the next sampling at step k + 1 described in Equation 3.16, we can equiv-
alently maximize the quadratic difference B(l) between the Bayes MMSE
estimates of the concept before and and after the measure is performed,
averaged over the possible outcomes. That is
B(l) =
∫
V
(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv. (3.18)
In order to show the equivalence we perform some manipulation of Equa-
tion 3.16. First we add and subtract Gk = EG[G|Dk] inside the terms of
the bilinear form and expand, so MQEk+1(l) becomes:
MQEk+1(l) =
∫
Vl
∫
G
(Gk+1−Gk)TA(Gk+1−Gk)p(G|Dk,Zl=v)p(Zl =v|Dk)dGdv
+ 2
∫
Vl
∫
G
(Gk+1−Gk)TA(Gk−G)p(G|Dk,Zl = v)p(Zl=v|Dk)dGdv
+
∫
Vl
∫
G
(Gk−G)TA(Gk−G)p(G|Dk,Zl=v)p(Zl=v|Dk)dGdx (3.19)
Since Gk+1 and Gk do not have a functional dependence on G the first
term can be rewritten as:∫
Vl
(Gk+1 − Gk)TA(Gk+1 − Gk)
(∫
G
p(G|Dk,Zl = v)dG
)p(Zl = v|Dk)dv
(3.20)
and the second term as:
2
∫
Vl
(Gk+1 − Gk)⊤p(Zl = v|Dk)A
(Gk
∫
G
p(G|Dk,Zl = v)dG
−
∫
G
Gp(G|Dk,Zl = v)dG
)dv. (3.21)
Of the three integrals inside brackets in the last two expressions the
first two evaluate to one (because some concept but be true) and the last
to EG[G|Dk,Zl = v] = Gk+1 showing that the second term is −2 times
34
CHAPTER 3. PROBLEM AND ALGORITHM 3.3. MAC: DERIVATION
the first. About the third term of the sum in Equation 3.19 we note that
neither Gk nor G depend on where and what the new sample is (they are
independent of l and v)) so this last part of the expression is constant.
The estimate of the mean quadratic error can be rewritten as:
MQEk+1(l) = const −
∫
Vl
(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv
(3.22)
which shows that in order to minimize it we need to maximize the expected
difference between the current estimate of Gk and the one at the next step
Gk+1.
We call B(l) the benefit function which has to be maximized to min-
imize MQEk+1(l):
B(l) =
∫
Vl
(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv. (3.23)
The Bayes-optimal choice among all candidates is
l∗ = arg maxl∈1,...,M
B(l). (3.24)
when the utility function is Equation 3.13. We call this benefit algorithm
as Maximum Average Change (MAC) sampling algorithm.
3.3.1 MAC: Information Gain
Intuitively the optimal strategy to select the best candidate is the one that
selects the set of missing entries Zl that will provide the most information
regarding G. If Zl is sampled, then the information gained about G when
data Dk has already been acquired, is given by
I(Zl;G|Dk) =
∫
Vl
D(p(G|Dk,Zl = v)||p(G|Dk))p(Zl = v|Dk)dv (3.25)
where D(.||.) is the Kullback-Leibler divergence. Therefore the most in-
formative set of missing entries Zl to probe at iteration k + 1 is given
35
3.4. SUMMARY CHAPTER 3. PROBLEM AND ALGORITHM
by
l∗ = arg maxl∈1,...M
I(Zl;G|Dk) (3.26)
We note that the objective function to be optimized is similar to the Equa-
tion 3.12 in Bayesian experimental design. MacKay [35] calls it the total
information gain criterion.
Even when the prior distribution of G is known, this optimization is
often intractable except by expensive Monte-Carlo methods. Since the
Kullback-Leibler divergence is a measure of distance between distributions
of G, we can approximate it by the squared difference between estimates
of the function G from the data, before and after adding the new data
point. Thus to choose the most informative candidate Zl to measure, the
approximated information gain to be maximized is given by:
I(Zl;G|Dk) ≃
∫
Vl
(Gk+1 − Gk)T (Gk+1 − Gk)p(Zl = v|Dk)dv. (3.27)
That is, the best point to sample for new data is where we have the most
expected change in our current estimate. Intuitively, since we do not know
the true value of the variable we are estimating, we can learn the most by
trying to maximize the change from out current estimate. Equation 3.27
is exactly Equation 3.23 when A = I, which means that, in the limit of the
approximations made, the MAC algorithm selects the candidate providing
the highest information at each step.
3.4 Summary
In this Chapter we described the basic notation needed and state the sam-
pling problem in detail. The necessary background knowledge is briefly
introduced (together with references to the related literature), about a
taxonomy for patterns of missing data, Bayesian estimation theory and
Bayesian experimental design. The proposed solution, called Maximum
36
CHAPTER 3. PROBLEM AND ALGORITHM 3.4. SUMMARY
Average Change (MAC) sampling algorithm, has been derived from the
Bayesian experimental design framework. The MAC algorithm samples
the missing entry which minimizes the square loss between the true value
of the concept to learn and its current estimate. The minimization is done
maximizing a more manageable benefit function which depends just on
the current estimate and the next one, in an expected sense. An intuitive
interpretation of this result, in terms of information theory, was provided.
37
3.4. SUMMARY CHAPTER 3. PROBLEM AND ALGORITHM
38
Chapter 4
MAC Implementation: Examples
and Applications
In this Chapter we deal with the problem of implementing the Maximum
Average Change (MAC) sampling algorithm, derived in Section 3.3, to
solve practical problems.
The MAC algorithm presents several challenges to implementation. First
it deals with Bayes MMSE estimates of the concept G to learn (E[G|Dk]),
that assume knowledge of the prior distribution of G. Furthermore it is
necessary to compute the conditional probability pV(Zl = v|Dk) of mea-
suring given values in given missing entries of the dataset, at each sampling
step. In both cases the dataset has missing values and proper techniques
to compute probability distributions and estimates need to be applied.
In the following we show some practical uses of the MAC algorithm
as main contributions of this research. They range from simple examples
to more complex applications. We will provide solutions to the issues
described previously using different techniques and assumptions.
As it will be shown later, the challenges to implement the MAC algo-
rithm depend on the exact target concept to learn and the constraints on
the incremental data filling scheme.
39
4.1. RESTRICTIVE ASSUMPTIONS CHAPTER 4. MAC: IMPLEMENTATION
4.1 MAC: More Restrictive Assumptions
In all implementations illustrated in this chapter we restrict the assump-
tions given in the derivation of the MAC algorithm (see Chapter 3). In all
cases, from now on, we assume that there are:
1. Constraints on what can be sampled at each step: at each sampling
step the experimenter is allowed to sample only one missing entry at
a time, so she has to decide which of the missing values, among those
left, improves most the estimate of the concept to be learned.
2. Constraints on cost model: the cost model is assumed to be flat,
meaning that every variable has the same unitary cost.
3. Reasonable estimates of the Bayes MMSE estimates G of the concept
to learn, if its prior probability is not available.
Even though these assumptions seem very restrictive they arise from several
needs. They are necessary to compare our algorithm to other algorithms
proposed in literature, they partly reflect the restrictions imposed by the
domains of applications (biomedical and agricultural domains) and they
reduce the complexity of the search space. See Chapter 6 for a description
of the future activity aimed to relax some of these new assumptions.
The new assumptions lead to a simplified version of the MAC algorithm.
The flat cost model implies that A = I, at least for the examples and
applications of this Chapter. Restricting what the experimenter can sample
at each step to one missing entry, allows to rewrite the set of allowed
candidates at each step to
Zk+1l ∈ (il, fl)l=1,...,M−k (4.1)
where M is the initial number of missing values, k is the sampling step,
and (i, f) denotes the i-th missing entry for variable Xf . The number of
40
CHAPTER 4. MAC: IMPLEMENTATION 4.2. SUMMARY
candidates is M − k in Equation 4.1 because after k sampling steps there
are M −k missing values remaining. The benefit function of Equation 3.23
then becomes
B(i, f) =
∫
Xf
(Gk+1 − Gk)T (Gk+1 − Gk)p(Xif = x|Dk)dx.
=
Q∑
j=1
∫
Xf
(Gj,k+1 − Gj,k)2p(Xif = x|Dk)dx (4.2)
where Gj,k is the j-th component of Gk and Xif : Ω → Xf is the ran-
dom variable describing the outcomes of the missing entry (i, f) of pattern
instance i on variable Xf .
4.2 Summary of Examples and Applications
Here follows a brief summary of each example and application given in this
chapter.
Examples
1. Guess my number [1]. We apply MAC algorithm to the game of guess-
ing a secret number given that we know whether previous attempts
were smaller or grater than the unknown value.
2. Conditional probability of two binary variables. Given two binary vari-
ables, we sample them in pairs with the target of estimating the con-
ditional probability of the second given the first. We are allowed to
choose the value of the first variable and measure the corresponding
value of the second one. We apply MAC algorithm to decide which ac-
tual value of the first variable to choose in order to estimate efficiently
the conditional probability.
41
4.2. SUMMARY CHAPTER 4. MAC: IMPLEMENTATION
Applications to Machine Learning Problems. As explained in the motiva-
tions given in Section 1.2, the main practical problem we address within
this research is estimating the relevances of variables in a class-labeled
dataset. Relevance is defined as the contribution a variable gives to pre-
dicting class labels. More detailed definitions will be given in the following
sections. Note that, even though the MAC algorithm does not explicitly
mention class labels in its formulation, it is sufficient to consider them as
another variable describing the pattern instances. The special role played
by class labels is meaningful only when computing the concept to learn G
and its estimates, whose form and formulation are domain dependent.
In this research we present two Machine Learning applications:
1. Relevance estimation of one new variable. Given a class-labelled dataset
(describing a set of monitored instances) and the error rate of a pre-
diction method on it, we want to estimate efficiently the reduction of
the error rate induced by adding one new variable. Initially the values
of this variable are unknown on all instances of the dataset. The prob-
lem is to decide on which instances to measure these missing entries,
assuming that each measurement is costly, only a limited budget is
available and only one measure at a time is feasible. If we assume
that we have a pool of candidate new variables (instead of just one),
then we repeat the previous relevance assessment on each of them
and, eventually, select the one which improves most the prediction.
An example of this application has been introduced in Section 1.1.1.
2. Concurrent relevance estimation of multiple variables. How to effi-
ciently estimate multiple feature relevances in a labelled dataset when
some (or all) feature values are missing, and sampling is performed
incrementally and concurrently on all features? As stated before, we
assume that only one missing entry at a time can be measured and
42
CHAPTER 4. MAC: IMPLEMENTATION 4.3. EXAMPLE 1: NUMBER
decide which missing entry to measure through the MAC algorithm.
4.3 Example 1: Guess My Number, or Learning a
Step Function
Number [1, 61] is an early 1970s text-based computer game where the
player is requested to iteratively guess a secret number selected from a
known integer interval. After each attempt an evaluation is given back to
the player stating whether the guess is bigger or smaller than the number
to guess. The game ends when the guess matches the secret number.
This game describes an active learning problem where the player needs to
estimate an unknown value (the secret number) using incremental sampling
and based on the information of outcomes of previous attempts. In this
section we study an extension of the original game where the set of available
values for the secret number is the real interval (0, 1). We show that the
application of the MAC algorithm in this context allows to derive the well
known binary search [8] algorithm.
In terms of the active learning problems described in this research the
game Number can be modeled as learning a step function f
y = f(x) =
0 if x < θ
1 if x ≥ θ(4.3)
where the parameter θ is the secret number to guess. We assume the prior
distribution of θ to be uniform on (0, 1): θ ∼ U(0, 1) . We are allowed to
iteratively query (sample) the value of y at any value of x of our choosing to
estimate θ. The set of monitored pattern instances is then the continuous
interval (0, 1).
After k sampling steps let xL be the highest sample point with y = 0, i.e.
xL = maxx∈xi=1,...,k x|f(xi) = 0, and xR be the lowest sample with y = 1,
43
4.3. EXAMPLE 1: NUMBER CHAPTER 4. MAC: IMPLEMENTATION
i.e. xR = minx∈xi=1,...,k x|f(xi) = 1. Since the posterior distribution of
θ given data is uniform on (xL, xR), θ ∼ U(xL, xR), the Bayes MMSE
estimate of θ given this data is
θMS = E[p(ϑ|x)] =
∫ 1
0
p(ϑ|x)p(ϑ)dϑ =
=
∫ 1
0
U(xL, xR)U(0, 1)dϑ =xR − xL
2(4.4)
Let us now compute the MAC benefit B(x) of sampling at a point
x ∈ (xL, xR) at step k + 1. First, since the posterior distribution of θ is
U(xL, xR), the probability of obtaining y = 0 when x is sampled is
p(y = 0|x) = (xR − x)/(xR − xL) (4.5)
and the new estimate of θ, if y = 0, would be
θ0k+1 = (x + xR)/2. (4.6)
Similarly
p(y = 1|x) = (x − xL)/(xR − xL) (4.7)
and the new estimate, if y = 1, would be
θ1K+1 = (x + xL)/2. (4.8)
We can rewrite the benefit function of Equation 4.2 as
B(x) =(θ0k+1 − θk
)2
p(y = 0|x) +(θ1k+1 − θk
)2
p(y = 1|x) =
=
(x + xR
2−
xL + xR
2
)2xR − x
xR − xL
+
(x + xL
2−
xL + xR
2
)2x − xL
xR − xL
=
=(xR − x)(x − xL)
4(4.9)
and we note that the maximum of B(x) is where
x =xL + xR
2. (4.10)
44
CHAPTER 4. MAC: IMPLEMENTATION 4.4. EXAMPLE 2: COND.PROB.
Therefore the value of x that maximizes B(x) is (xL+xR)/2 implying that,
for this problem, the MAC sampling algorithm behaves identical to the
familiar binary search algorithm.
4.4 Example 2: Estimation of Conditional Probabil-
ities
Consider the urn experiment where a set of N balls, an unknown fraction
being black and the remaining part being white, are present. At each step
the experimenter picks one ball from the urn. Assume that the experi-
menter can request to examine one ball of the desired color (if available).
After receiving the ball the experimenter opens it: inside each ball there
is a number: 0 or 1. The goal of this experiment is to estimate the condi-
tional probability of getting 0 (or 1) given the color of the ball, examining
as few balls as possible.
Since at each step the experimenter has to decide which ball to pick
form the urn (black or white), a policy has to be defined. A very inefficient
sampling policy could be to picks black balls first and white balls only
later, after the black ones are exhausted. The estimate of the conditional
probability of getting 1 (or 0) given black balls would converge extremely
fast to the true value; but the conditional probabilities given white balls
would converge extremely slowly, since no useful information would be
available before all black balls are picked. A more efficient policy would be
the random one, that picks one ball at random from the urn without taking
into account the color but according to the distribution of colors. Other
policies can be proposed. In the following we derive the MAC sampling
policy for this problem.
45
4.4. EXAMPLE 2: COND.PROB. CHAPTER 4. MAC: IMPLEMENTATION
4.4.1 Formal Description of the Urn Experiment
Let X1 and X2 be binary random variables having values in 0, 1 and
unknown relation. Assume we can sample a pair < X1, X2 > deciding
which value of X1 we like and then measuring the corresponding X2 value.
In table 4.1 it is shown that the p.m.f. P (X2 = x2|X1 = x1) depends upon
two parameters, a and b.
X2 = 0 X2 = 1
X1 = 0 a 1 − a
X1 = 1 1 − b b
Table 4.1: Conditional probabilities P (X2 = x2|X1 = x1) parametrized by a and b
.
Assuming that the marginal distribution of X1 is P (X1 = 0) = c,
P (X1 = 1) = 1 − c we derive the joint p.m.f P (X1, X2), as shown in
table 4.2.
X2 = 0 X2 = 1
X1 = 0 ca c(1 − a) c
X1 = 1 (1 − c)(1 − b) (1 − c)b 1-c
Table 4.2: Joint probability P (X1 = x1, X2 = x2) parametrized by a, b and c.
4.4.2 MAC sampling algorithm
We now compute the benefit Bk+1 of sampling at X1 = 0 or X1 = 1 after k
sampling steps, using Equation 4.2. We denote ak and bk the Bayes MMSE
estimates of parameters a and b computed from data collected until step
k. Then the target concept to learn is G = (a, b) and G0,k = ak, G1,k = bk.
46
CHAPTER 4. MAC: IMPLEMENTATION 4.4. EXAMPLE 2: COND.PROB.
The benefit function when sampling at step k + 1 is then:
Bk+1(X1 = x1) =
Q∑
j=1
∫
Xf
(Gj,k+1 − Gj,k)2p(Xif = x|Dk)dx =
= (ak+1,s=(x1,0) − ak)2P (X2 = 0|Dk, X1 = x1) +
+ (bk+1,s=(x1,0) − bk)2P (X2 = 0|Dk, X1 = x1) +
+ (ak+1,s=(x1,1) − ak)2P (X2 = 1|Dk, X1 = x1) +
+ (bk+1,s=(x1,1) − bk)2P (X2 = 1|Dk, X1 = x1) (4.11)
where ak+1,s=(x1,x2) denotes an estimate of a using current data Dk plus
the sample < X1 = x1, X2 = x2 > and P (X2 = x2|Dk, X1 = x1) is the
probability of obtaining X2 = x2 when sampling at X1 = x1 after observing
data Dk. Note that
P (X2 = 0|Dk, X1 = 0) = ak
P (X2 = 1|Dk, X1 = 0) = 1 − ak
P (X2 = 0|Dk, X1 = 1) = 1 − bk
P (X2 = 1|Dk, X1 = 1) = bk.
(4.12)
4.4.3 Explicit Benefit Function
Let nij, where i, j ∈ 0, 1, denote the total number of samples where that
have been observed at step k for which X1 = i and X2 = j as shown in
Table 4.3.
X2 = 0 X2 = 1
X1 = 0 n00 n01
X1 = 1 n10 n11
Table 4.3: Binary counts: nij is the number of observations for which X1 = i and X2 = j.
47
4.4. EXAMPLE 2: COND.PROB. CHAPTER 4. MAC: IMPLEMENTATION
Under the assumption of squared-error loss and Beta(A, B) prior1 the
Bayesian MMSE estimator for the parameter of the conditional distribution
a and b is the posterior distribution of Bernoulli trials under Beta(A, B)
prior (see [3]):
aMS =n00 + A
n00 + n01 + A + B
bMS =n11 + A
n10 + n11 + A + B. (4.13)
Since the estimate of a is not affected by samples having X1 = 1 and
the estimate of b is not affected by samples having X1 = 0, the benefit
Equation 4.11 can be rewritten as:
Bk+1(X1 = 0) = (ak+1,s=(0,0) − ak)2ak + (ak+1,s=(0,1) − ak)
2(1 − ak).
and
Bk+1(X1 = 1) = (bk+1,s=(1,0) − bk)2(1 − bk) + (bk+1,s=(1,1) − bk)
2bk.
Inserting the explicit formula of estimators we can derive the final algo-
rithm for computing the benefit function at step k + 1:
Bk+1(X1 = 0) =
(n00 + A + 1
n00 + n01 + A + B + 1−
n00 + A
n00 + n01 + A + B
)2
×n00 + A
n00 + n01 + A + B+
+
(n00 + A
n00 + n01 + A + B + 1−
n00 + A
n00 + n01 + A + B
)2
+
×
(1 −
n00 + A
n00 + n01 + A + B
)=
=(n00 + A)(n01 + B)
(n01 + n00 + A + B)2(n01 + n00 + A + B + 1)2(4.14)
1Note that, in principle, parameters A and B of the prior on a are different from those of b. Here we
assume they are equal since we have no prior knowledge to assume a different belief.
48
CHAPTER 4. MAC: IMPLEMENTATION 4.4. EXAMPLE 2: COND.PROB.
when sampling at X1 = 0 and
Bk+1(X1 = 1) =(n11 + A)(n10 + B)
(n10 + n11 + A + B)2(n10 + n11 + A + B + 1)2(4.15)
when sampling at X1 = 1.
The previous results shows some interesting aspects of the benefit for-
mulas:
• Bk+1(X1 = 0) and Bk+1(X1 = 1) have same form.
• The benefit of sampling at X1 = xi always decreases if sampling is
actually done, otherwise remains constant. The reason is that the
denominator of Equations 4.14 and 4.15 grows much faster (O(n4))
than the numerator (O(n2)).
Since the benefit of one candidate decreases when it is actually sampled,
the MAC sampling policy will alternate sampling at X1 = 0 and X1 = 1,
even though not strictly.
When the number of samples with X = 0 is the same of those with
X1 = 1, the denominators of Bk+1(X1 = 0) and Bk+1(X1 = 1) are equal.
Then, the MAC policy will sample where the numerator is greater, i.e. at
X1 = 0 if
(n00 + A)(n01 + B) > (n11 + A)(n10 + B)
otherwise at X1 = 12. If A and B are negligible with respect to nij then
MAC policy will sample where the product ni0ni1 is bigger. Since the
maximum of the product is when ni0 = ni1 we claim that the MAC policy
will sample where the conditional probability is closer to 1/2.
We note that the MAC policy is not affected by the marginal probability
P (X1). This is different from the random sampling policy which selects the
value of X1 only according to P (X1) and could become very inefficient in
2In case numerators are equal MAC policy will choose with equal probability X1 = 0 or X1 = 0.
49
4.5. APPLICATION 1 CHAPTER 4. MAC: IMPLEMENTATION
estimating both a and b when P (X1 = 0) ≫ P (X1 = 1). See experimental
results in Section 5.1 for a comparison of the MAC policy against the
random sampling policy.
4.5 Application 1: Single Feature Relevance Estima-
tion
In this application of the MAC algorithm we study the learning problem
of evaluating new candidate variables describing a set of pattern instances
with respect to given class labels. Assume that the pattern instances are
described by a known set of variables whose values, together with class
labels, are fully known in advance. A given predictor is built over this
known data and its error rate is computed. The final goal of this application
is to add one new variable to those already in use selecting it from a pool
of new candidates whose values are initially completely unknown. The
selected new variable will be the one that, together with the initial data,
will reduce most the error rate of the predictor.
Since measuring new candidate variables is assumed to be costly we
have to carefully select on which instance to perform a measurement at
each step. After collecting the new value the error rate of the prediction is
recomputed.
The whole process can be described as follows:
1. Select one new candidate variable (at random).
2. Iteratively select the most interesting missing entry (using a sampling
policy) and measure it, spending a part of the per-variable budget.
We assume that each new candidate variable has the same budget,
meaning that each one will be measured the same number of times.
3. Update the relevance of the candidate variable.
50
CHAPTER 4. MAC: IMPLEMENTATION 4.5. APPLICATION 1
4. Loop on step 2 until budget for that variable is exhausted.
5. Go back to step 1 and select another candidate variable until all can-
didates are evaluated and the global budget is exhausted.
6. Rank candidate variables according to their estimated relevances.
7. Add the most relevant variable to the set of known variables in a new
predictor and deploy it.
If a given variable would be measured on all pattern instances, then we
would be able to know what is the exact change in error rate due to the
contribution of that variable on that dataset. The target of this process
is then to efficiently estimate the change in error rate of the predictor
measuring the new variable a few number of times on the most informative
instances. At the end of the process another candidate variable undergoes
the same process and its reduction of the error rate is estimated. The
variable that reduces most the initial error rate will be added to the initial
set of variables for building a more efficient predictor.
4.5.1 Motivation
In the following we motivate this problem in biomedical and agricultural
domain. The motivation has already been mentioned in Section 1.2. Here
we add the necessary details in order to be able to elicit an abstract for-
mulation of the underlying process. After formalization we will derive the
specific implementation of the MAC algorithm. Furthermore, we will de-
rive the implementation of a baseline random sampling algorithm and other
algorithms from the literature to enable experimental comparisons.
51
4.5. APPLICATION 1 CHAPTER 4. MAC: IMPLEMENTATION
Cancer Characterization
Current models for cancer characterization, that lead to diagnostic/prognostic
models, mainly involve the histological parameters (such as grade of the
disease, tumor dimensions, lymphonode status) and biochemical parame-
ters (such as the estrogen receptor). The diagnostic models used for clinical
cancer care are not yet definitive and the prognostic and therapy response
models do not accurately predict patient outcome and follow up. For ex-
ample, for lung cancer, individuals affected by the same disease and equally
treated demonstrate different treatment responses, evidencing that still un-
known tumor subclasses (different istotypes) exist. This incomplete view
results, at times, in the unnecessary over-treatment of patients, that is
some patients do not benefit from the treatment they undertake. Diagnos-
tic and prognostic models used in clinical cancer care can be improved by
embedding new biomedical knowledge. The ultimate goal is to improve the
diagnostic and prognostic ability of the pathologists and clinicians leading
to better decisions for treatment and care.
Ongoing research in the study and characterization of cancer is aimed
at the refinement of the current diagnostic and prognostic models. As
disease development and progression are governed by gene and protein
behavior, new biomarkers associated with patient diagnosis or prognosis
are investigated. The identification of new potential biomarkers is recently
driven by high throughput technologies, called microarrays. They enable
the identification of genes that provide information with a potential impact
on understanding disease development and progression [21].
Although the initial high throughput discovery techniques are rapid,
they often only provide qualitative data. Promising genes are further ana-
lyzed by using other experimental approaches (focusing on DNA, RNA or
proteins), to test specific hypotheses. Usually a well characterized dataset
52
CHAPTER 4. MAC: IMPLEMENTATION 4.5. APPLICATION 1
of tumor samples from a retrospective population of patients is identified
and the experimental process of analyzing one biomarker (feature) on one
sample at a time is conducted. These analyses are usually based on com-
parisons between 1)non-diseased (i.e., normal) and diseased (i.e., tumors)
biological samples, 2)between diseased samples pharmacologically treated
and untreated at variable time points or 3)between samples of different
diseases. The efficacy of specific biomakers can for example be determined
based on their discriminative power in distinguishing between patients with
poor or good prognosis, meaning patients with short or long overall sur-
vival respectively or cancer recurring or not recurring. This process can be
time consuming, depending on the type of experimental technique which
is adopted.
More importantly, well annotated tissue samples are very precious. Mon-
itoring the status of patients over years, even decades, and storing tissue
samples so as to be useful for studies is not trivial and requires organiza-
tional efforts. It is not uncommon, for example, that patients who received
the treatment at a hospital, will be monitored during the follow up period
in another hospital and even in another country. Therefore keeping track
of their status may become quite difficult. When the biomarker is tested
on a biological sample, a portion of the sample is consumed, implying that
each sample can be used for only a finite number of experiments. This
motivates the need to develop an active sampling approach to conserve the
valuable biological sample resource [59].
Apple Proliferation
Our interest in sampling strategies began in a research project (SMAP)3 in
the domain of agriculture, dealing with the Apple Proliferation disease [51]
3This work was funded by Fondo Progetti PAT, SMAP (Scopazzi del Melo - Apple Proliferation), art.
9, Legge Provinciale 3/2000, DGP n. 1060 dd. 04/05/01.
53
4.5. APPLICATION 1 CHAPTER 4. MAC: IMPLEMENTATION
in apple trees. Biologists monitor a distributed collection of apple trees
affected by the disease with the goal of determining the symptoms that
indicate the presence of the disease causing phytoplasma. A data archive
is arranged with a finite set of records, each describing a single apple tree.
All the instances are labeled as infected or not infected. Each year the
biologists propose new candidate features (e.g, color of leaves, altitude of
the tree, new chemical tests etc.) that could be extracted (or measured)
to extend the archive, so as to arrive at more accurate models. Since the
data collection on the field can be very expensive or time consuming, a
data acquisition plan needs to be developed by selecting a subset of the
most relevant candidate features that are to be acquired on all trees [60].
The selection of a new candidate variable among many for the next,
extensive, data collection campaign, is accomplished through a process
equivalent to the biomarkers case.
4.5.2 Implementation: Error Rate Estimation
In order to decide how much each new candidate feature Xi ∈ X1, . . . , XW
is relevant, in the sense introduced so far, we need to evaluate the benefit
of adding it to the current set of features X = X1, . . . , XN when predicting
the class label C. Therefore the concept G to be estimated is the classi-
fication error rate (denoted by ǫ) of a given classifier on the feature space
comprising the known features and the candidate new feature.
We assume features and class labels to be categorical and compute the
error rate by summing over the joint feature and class probabilities of all
but the winning class over the entire feature space. Although we do not
make the Naıve Bayes assumption of class-conditional feature indepen-
dence, it would be easy to incorporate into the algorithm any independence
information supplied by the domain expert.
We note that our sampling algorithm can be applied with any other
54
CHAPTER 4. MAC: IMPLEMENTATION 4.5. APPLICATION 1
appropriate measure for feature relevance (such as the width of the classi-
fication margin [55]) and estimate of the conditional feature probability.
We now briefly describe how the active sampling algorithm is imple-
mented and provide the equations for the estimation of the probability
distribution and of the classification error rate. All probability distribu-
tions are assumed multivariate categorical whose parameters are estimated
from data using Bayes MMSE estimators under uniform Dirichlet priors.
Due to the difficulty in obtaining the exact Bayes MMSE estimate of the
error rate, we approximate it by the error rate computed from the Bayes
estimate of the distribution p(C,X, X) over C × X1 × . . .XN × X , where
X ∈ X is the new candidate variable.
At a given iteration k of the active sampling process some of the in-
stances have feature value X missing. Moreover because of the active sam-
pling, the missing values are not uniformly distributed. MacKay in [35]
asserts that the biases introduced in our estimates because of non-random
sampling can be avoided by taking into account how we gathered the data.
Therefore to construct the estimator p(C,X, X) over C ×X1 × . . .XN × X
it is necessary to consider the sampling process. Since all the examples
in the database are completely described with respect to C and X, we
already have the density p(C,X). Although at any iteration of the active
sampling algorithm the X values are missing non-uniformly across various
configurations of (C,X), for each pattern instance s = (c,x) the samples
for X are independent and identically distributed. We incorporate this in-
formation in the estimator of the probability density from incomplete data
Dk as follows. We first calculate
p(X = x|Dk) = pDk(x|c,x) =
nc,x,x + 1∑
X nc,x,x + |X |(4.16)
where nc,x,y is the number of instances of the particular combination of
(c,x, x) among all the completely described instances in Dk, and |X | is the
55
4.5. APPLICATION 1 CHAPTER 4. MAC: IMPLEMENTATION
size of the sample space X . The probability density over C × X1 × . . . ×
XN × X is then calculated as
pDk(c,x, x) = pDk
(X = x|c,x) × p(c,x) (4.17)
Once we have the estimate pDk(c,x, x), the error rate ǫ(Dk) can be esti-
mated as
ǫ(Dk) = 1 −∑
X1×...×XN×X
maxc∈C
pDk(c,x, x) (4.18)
4.5.3 Sampling Algorithms Compared
We compare the MAC sampling algorithm with others previously proposed
in the literature that can be adapted to this context (See Section 2.3).
These algorithms differ only in the benefit criterion (B) of sampling at a
given instance s = (c,x). In the following we illustrate the exact benefit
functions under comparison for each case, and start with our proposed one:
1. Maximum Average Change (MAC)
Given a candidate feature X ∈ Xii=1,...,W , we measure iteratively
its values. In order to decide which entry to probe we loop over all
s = (c,x) where X is missing and compute the benefit of sampling at
s given by Equation 4.2, which can be rewritten as
B(s) =
∫
X
(ǫ(Dk, Xs = x) − ǫ(Dk))
2p(Xs = x|Dk)dx (4.19)
where Xs = x means that instance s has value x ∈ X on variable X.
We then perform the sampling on s with the maximum value of B(s).
2. Single Feature Lookahead (SFL)
The Single Feature Lookahead active feature sampling heuristic was
proposed in [34]. Their algorithm, which chooses an instance on which
56
CHAPTER 4. MAC: IMPLEMENTATION 4.5. APPLICATION 1
a feature value is queried only by its class label, was developed for bud-
geted learning of Naive Bayes classifiers. We implemented a straight-
forward extension to their algorithm for choosing the combination of
both the class label and the previous feature value. The SFL algorithm
works as follows.
For each s ∈ C × X1 × . . . × XN it is assumed that d samples with
the particular s are to be probed, where d is called max-depth. The
expected error rate of the classifier is computed and the instance s with
the lowest error rate is selected and the feature value of X is collected
once. We refer the reader to [34] for more details on the algorithm.
When the max-depth parameter is set to d = 1, the SFL heuristic
reduces to the Greedy Loss Reduction heuristic, also presented in [34].
The benefit function used for active sampling can be written as
B(s) = −E[ǫ(Dk, (X = x)d)] (4.20)
where (X = x)d indicates that instead of just one, d samples are
assumed to be acquired. The expectation is over all the possible results
of collecting X on d samples with the particular s.
3. Impute & Minimum Error (IME)
This heuristic is identical to the Goal Oriented Data Acquisition(GODA)
algorithm proposed in [65] (albeit using a different estimate of the er-
ror rate). They first impute the value of X for the s in question from
the available data and then compute the error rate of the classifier
built after adding the imputed sample to Dk. The sample s that
yields the lowest error rate (given current data) is chosen for probing.
B(s) = −ǫ(Dk, X = x∗) (4.21)
where
x∗ = arg maxx∈X
p(X = x|Dk)
57
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
4. Random Sampling At sampling step k the random sampling scheme
just chooses a missing entry randomly from Dk to be probed. The
results obtained from the random sampling method serve as a baseline
against which the other sampling heuristics are evaluated.
For results on comparing these different sampling algorithms and the
proposed method over synthetic and benchmark datasets, as well as on
biomarkers and agricultural domains, see Section 5.2.
4.6 Application 2: Concurrent Estimation of Multi-
ple Feature Relevances
In this application of the MAC algorithm we study again the learning prob-
lem of evaluating variables that describe pattern instances with respect to
given class labels, with the final goal of variable selection. Assume that
pattern instances are described by a set of variables and class labels in a
dataset. Assume that initially the actual class labels are fully known for
every instance, but feature values are just partially known (or completely
unknown). The experimenter is allowed to iteratively probe the missing
value of one instance over one variable at a time; measuring new values is
assumed to be costly and only a limited budget is available. The aim of this
application is then to estimate the relevances of all variables that describe
the instances when predicting class labels, while allocating efficiently the
budget for acquiring new values. In contrast to from Application 1 (see
Section 4.5) missing values are present in more than one variable in the
dataset at each sampling step, meaning that deciding where to measure the
next value is deciding which pattern instance to sample but also which fea-
ture. This application then defines a process where each variable competes
against the others to be selected for sampling for concurrent relevance esti-
mation. As an example of the differences with respect to Application 1, it
58
CHAPTER 4. MAC: IMPLEMENTATION 4.6. APPLICATION 2
could happen that some features will be sampled more times than others.
The schema of this sampling process is shown in Figure 1.2.
We will clarify only later the exact method to compute the variables’
relevances. The aim is to provide a plug-in architecture for the feature
evaluation process where the experimenter can, in principle, decide among
different feature raters to better suite domain needs.
4.6.1 Motivation
The same motivation explained in Section 4.5.1 for the case of Application
1 applies here. In the case of cancer characterization using biomarkers we
refer now to a slightly different scenario where all biomarkers are evaluated
concurrently; then, the dataset comprises class labels, all known variables
and all new candidate variables. Since the missing values are present only
in this last group, the competition is just among new candidate variables.
We refer to a similar scenario for the agricultural domain investigating
Apple Proliferation disease.
4.6.2 Implementation of the Active Sampling Algorithm
We describe the sampling process of this application using Algorithm 4.6.1.
This Algorithm for active feature value acquisition is general in the sense
that it can incorporate any measure for feature relevance for which the
squared-error loss is reasonable. That is, the function EstimateRelevances(D)
in the pseudocode can be any estimate of feature relevance that can be es-
timated from a dataset with missing values.
In the following we illustrate the details of the assumptions made to
derive an implementation of the Equation 4.2. First we present the model
for data generation (i.e., the joint class-and-feature distribution), then we
explain how the conditional probabilities and feature relevances (the two
59
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
main ingredients of the MAC algorithm) can be computed given the joint
distribution.
Our model is applicable for problems with categorical valued features.
That is, we assume that every feature Xf takes on a discrete and finite set
of values Xf = 1, . . . , Vf.
Algorithm
4.6.1: AcquireOneMissingValue(Dk)
G(Dk) = EstimateRelevances(Dk)
for each (i, f) such that record i has feature value f missing
B[i, f ] ← 0 comment: Initialize the value of the benefit to zero
for each x ∈ Xf
Dtemp = Dk.F illV alue(Xif = x)
G(DTemp) = EstimateRelevances(DTemp)
B[i, f ] = B[i, f ] + ComputeBenefit(G(Dtemp), G(Dk))
end
end
comment: Now find the missing entry with the highest benefit
(i∗, f ∗) = arg maxi,f(B[i, f ])
comment: Now query the value for the missing entry
x∗ = SampleMissingV alue(i∗, f ∗)
comment: Fill the missing value
Dk+1 = Dk.F illV alue(Xi∗f∗ = x∗)
return (Dk+1)
60
CHAPTER 4. MAC: IMPLEMENTATION 4.6. APPLICATION 2
4.6.3 Mixture model
A convex combination of probability distributions is called probability mix-
ture model or mixture model
P (X = x) =M∑
m=1
αmpm(X = x) (4.22)
where X is a random vector taking values x ∈ X , 0 ≤ α ≤ 1 are the
coefficients expressing the contribution of each component pm(X = x)
and∑M
m=1 αm = 1. Usually the probability distributions pm come from a
parametric family with unknown parameter θm. In this case the probability
mixture model is
P (X = x) =M∑
m=1
αmp(X = x|θm). (4.23)
Even though continuous mixture models
P (X = x) =
∫
Θ
α(θ)p(X = x|θ)dθ (4.24)
are considered in literature we do not address them in this research. See
[37, 36] for a thorough review of mixture models.
(Non-)Identifiability
Due to the summation in Equation 4.22 the components of mixture mod-
els are exchangeable, in the sense that we cannot distinguish a component
from another component when we want to interpret the parameters (αm)
discovered by fitting the model. Moreover, with certain parametric fam-
ilies of probability distributions (e.g. multivariate Bernoulli) completely
different sets of components [23] lead to the exactly the same mixture
model. The whole problem is known in literature as identifiability (or
61
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
non-identifiability)[37]; the first issue (exchangeability) is known as triv-
ial identifiability and the latter, specific to certain families of probability
distributions, as non-trivial identifiability.
Since we are not concerned with interpretation of the parameters of the
mixture model in this research we do not investigate the techniques to
overcome the identifiability issue. We claim that neither trivial nor non-
trivial identifiability are relevant to the use of mixture models made in this
research.
Further information on techniques to overcome the identifiability issues
can be found in the literature of mixture models, e.g. in the book of
McLachlan et Peel [37] and in Carreira-Perpinan et al. [5].
4.6.4 Class-Conditional Mixture of Product Distributions
We assume that each class-conditional feature distribution is a mixture
of M product distributions over the features. Although for our imple-
mentation it is not necessary that the number of components be constant
across classes, we make this assumption for simplicity. That is, the class-
conditional feature distribution for class c ∈ C is
P (X1 = x1, . . . , XF = xF |C = c) =M∑
m=1
αcm
F∏
f=1
Vf∏
x=1
θδ(x,xf )cmfx (4.25)
where αcm is the mixture weight of component m for class c, θcmfx is the
probability that the feature f takes on the value x for component m and
class c, and δ(.) is the Kronecker delta function. Note that if M = 1 our
model is equivalent to the Naıve Bayes model.
Therefore the full class-and-feature joint distribution can be written as
P (C = c, X1 = x1, . . . , XF = xF ) =∑
c∈C
p(C = c)M∑
m=1
αcm
F∏
f=1
Vf∏
x=1
θδ(x,xf )cmfx
(4.26)
62
CHAPTER 4. MAC: IMPLEMENTATION 4.6. APPLICATION 2
where p(C = c) is class probability. The class-and-feature joint distribution
is completely specified by the parameters αs, θs and the class probabilities.
Before we describe how the α and θ parameters can be estimated from
a dataset with missing values, we will explain how feature relevances and
the conditional probability p(Xif = x|Dk) are calculated if the parameters
are known.
4.6.5 Calculation of Feature Relevances
We use the mutual information between a feature and the class variable as
our measure of the relevance of that feature. That is
Gf = I(Xf ; C) = H(Xf) − H(Xf |C) (4.27)
Although we are aware of the shortcomings of mutual information as a
feature relevance measure, especially for problems where there are inter-
feature correlations, we chose it because it is easy to interpret and to
compute given the joint class-and-feature distribution. We did not use
approaches such as Relief [26], SIMBA [20] or I-Relief [55], that provide
feature weights (that can be interpreted as relevances), because they do
not easily generalize to data with missing values. See Chapter 6 for future
plans on this topic.
The entropies in Equation 4.27 can be computed as follows.
H(Xf) = −∑
c∈C
Vf∑
x=1
p(C = c, Xf = x) log(p(C = c, Xf = x)) (4.28)
H(Xf |C) = −∑
c∈C
Vf∑
x=1
p(Xf = x|C = c) log(p(Xf = x|C = c))p(C = c)
(4.29)
If the α and θ parameters and p(c) of the model are known, the mutual
63
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
information can be computed as follows.
H(Xf) = −
Vf∑
x=1
(∑
c∈C
p(c)M∑
m=1
αcmθcmfx
)log
(∑
c∈C
p(c)M∑
m=1
αcmθcmfx
)
(4.30)
H(Xf |c) = −∑
c∈C
p(c)
Vf∑
x=1
(M∑
m=1
αcmθcmfx
)log
(M∑
m=1
αcmθcmfx
)(4.31)
4.6.6 Calculation of Conditional Probabilities
For simplicity in the following we omit the subscript k in all notations even
though the values we refer to will change at each sampling step.
Since the instances in the dataset D are assumed to be drawn indepen-
dently, we have
p(Xif = x|Dk) = p(Xif = x|Xobs(i) = xobs(i), C = ci)
=p(Xif = x,Xobs(i) = xobs(i)|C = ci)
p(Xobs(i) = xobs(i)|C = ci)(4.32)
where, obs(i) is the set indices of variables/features that are observed for
instance i, so Xobs(i) are features that are observed for that instance which
take on values xobs(i); and ci is the class label for instance i.
Therefore the conditional probability in Equation 4.32 can be written
in terms of the parameters of the joint distribution as
p(Xif = x|Dk) =
∑Mm αcimθcimfx
∏φ∈obs(i) θcimφxiφ∑M
m αcim
∏φ∈obs(i) θcimφxiφ
(4.33)
and Xif is the random variable describing the value of feature Xf on in-
stance i, and x is actual (observed) value.
64
CHAPTER 4. MAC: IMPLEMENTATION 4.6. APPLICATION 2
4.6.7 Parameter Estimation
Since after each sampling step we only have a dataset with missing values
and not the parameters αs, θs and p(c) that describe our model, they
need to be estimated from the data. Once we have the estimates, the
conditional probabilities and feature relevances can be computed by using
the estimates in place of the parameters in Equations 4.33, 4.30 and 4.31.
We will now describe how these parameters are estimated.
Estimation of p(c) : Since we have class labels of all the records in the
dataset, the estimates of the class probabilities are obtained from the
(Laplace smoothed [11, 24, 46]) relative frequencies of the classes in the
data set.
Estimation of αs and θs : We need to estimate the parameters of the class-
conditional mixture distribution for all classes. Since we have class labeled
instances we can perform the estimation separately for each class, consid-
ering only the data from that particular class. We therefore suppress the
subscript c for the parameters corresponding to the class variable in the
following equations.
Expectation-Maximization
The expectation-maximization (EM) algorithm [13, 3] is a technique for
finding maximum likelihood solutions in probability models with latent
variables. A latent variable is a variable not directly observed but inferred
from observed variables through a mathematical model. As an example
related to our context, the αm coefficients of the mixture model are latent
variables.
Assume that D is a dataset without missing data, D = x1, . . . , xNT
and each record xi ∈ RF . We denote the set of latent variables by Z =
z1, . . . , zN . Assume then that the joint probability distribution over all
65
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
F variables of the dataset is parametrized by θ. The log-likelihood of the
joint distribution is then
ln (p(D|θ)) = ln∑
z
p(D,Z|θ) (4.34)
In order to find the maximum likelihood estimate of parameters θ we can
equivalently maximize the complete likelihood on the right side of Equa-
tion 4.34. If Z were known we could simply maximize directly the complete
likelihood to compute θs, which we assume to be straightforward. The
expectation-maximization algorithm provides a solution following these
steps:
1. At t = 0 guess an initial value for parameters θt=0.
2. Compute the expected value of the complete log-likelihood lc given
the current θt and data D, called Q(θ, θt):
Q(θ, θt) = EZ[lc(θ|D,Z)|D, θt]. (4.35)
The expectation is made over all possible values of the latent variables
Z.
3. Find θt+1 such that
θt+1 = arg maxθ∈Θ
Q(θ, θt) (4.36)
4. Check if the new values or the likelihood converged, otherwise
θt ← θt+1 (4.37)
and return to step 2.
Step 2 is called expectation step or E-step and step 3 is called maximization
step or M-step.
66
CHAPTER 4. MAC: IMPLEMENTATION 4.6. APPLICATION 2
Dempster et al. [13] proved that the EM algorithm does not decrease
the observed likelihood at each step meaning that the algorithm converges
to a maximum. This does not guarantee that the algorithm converges to
the global maximum and in fact in case of a multimodal distribution the
EM algorithm converges just to a local maximum. Various heuristics has
been proposed to escape local maxima, like trying different random initial
guesses.
EM: Mixture of Product Distribution
Let Dc be the part of the dataset corresponding to class c. The data
likelihood in case of a mixture of product distributions is given by
l(Dc, θ) =N∑
i=1
logM∑
m=1
p(xi|αm, θm)p(αm) (4.38)
When the dataset Dc has no missing values the EM update equations
for θs can be shown to be
θt+1mfx =
∑Nn=1 δ(x, xnf)hnm∑N
n=1 hnm
(4.39)
αt+1m =
1
N
N∑
n=1
hnm (4.40)
where
hnm = E[Znm = 1|xn, θt] =
αm
∏Ff=1 θt
mjxnf∑Mm=1 αm
∏Ff=1 θt
mjxnf
(4.41)
See Appendix A.1 for the derivation of Equation 4.39.
EM and Missing Data
The EM algorithm has another important application: learning from datasets
with missing data [33, 19].
67
4.6. APPLICATION 2 CHAPTER 4. MAC: IMPLEMENTATION
Assume that D can be divided in the observed part Do and the missing
part Dm, meaning that each record is made of some observed values and
some missing values, xi = (xoi , x
mi ), and the missingness pattern is record-
dependent. In this case the expectation of the E-step has to take into
account both the latent variables and the missing values
Q(θ, θt) = EZ,Dm[lc(θ|Do, Dm,Z)|Do, θt]. (4.42)
EM: Mixture of Product Distribution with Missing Data
Since in our problem there are missing values, we derived the EM update
equation for the mixture of product distribution obtaining
θt+1mfx =
∑Nn=1 ho
nm(θtmfx(1 − Inf) + δ(x, xnf)Inf
∑Nn=1 ho
nm
(4.43)
αt+1m =
1
N
N∑
n=1
honm (4.44)
where
honm = E[Znm|xobs(i)] =
αm
∏j∈obs(n) θ
tmjxnj∑M
m=1 αm
∏j∈obs(n) θ
tmjxnj
(4.45)
and where Inf = 1 when the feature f for record n is observed, otherwise
Inf = 0. Note that in the actual implementation of Equation 4.43 we per-
form Laplace smoothing to reduce estimation variance. See Appendix A.2
for the derivation of Equation 4.43.
4.6.8 Comparison with Application 1
Since many assumptions in the underlying probability model and the fea-
ture relevance evaluation algorithm are different between the two appli-
cations presented in this chapter we do not compare their results in this
research. See Chapter 6 for our future plans in making this comparison
more feasible.
68
CHAPTER 4. MAC: IMPLEMENTATION 4.7. COMPUTATIONAL ISSUES
4.7 Computational Complexity Issues
As mentioned at the beginning of this chapter the implementation of the
MAC algorithm is the source of many challenges. Some of them are about
the computational complexity of finding the design having the highest ben-
efit. Equation 4.2 is often difficult to maximize directly: Example 1 illus-
trated in Section 4.3 is one of the very few cases for which we were able to
find an explicit form of the design having maximum benefit. In almost all
cases the benefit value has to be computed for each design and the number
of designs could be excessively large. Consider, for example, an extension
of Application 2 where at a given sampling step there are N missing val-
ues: if the target is finding the most effective subset of any size, then the
number of designs is 2N . Often this number can be reduced using domain
knowledge. In all practical examples we met in our research and the liter-
ature, the number of designs which is meaningful to evaluate for a given
problem is much smaller due to domain constraints. Common examples of
such domain constraints on what is feasible to measure at each sampling
step were mentioned in Section 3.1. We report them here:
• Only one missing entry.
• All missing entries of a given pattern instance.
• A given number of missing entries (a batch).
• All missing entries of one variable.
With the help of this kind of constraints the problem of the number of
different designs becomes more tractable. But even in this case the com-
putation could require large resources.
Another strategy to reduce the number of designs to evaluate is based
on the fact that some designs could be equivalent, i.e. leading to the same
69
4.8. SUMMARY CHAPTER 4. MAC: IMPLEMENTATION
benefit4. In some cases it is possible to know in advance if two designs
are equivalent, avoiding to compute the same benefit twice. A simple
example, related to Application 1 (see Section 4.5) and Application 2 (see
Section 4.6) is when two pattern instances has exactly the same missing
variables and the values of the known variables are identical; due to the
independence assumption, sampling one pattern instance over a variable
whose value is currently missing, or sampling the other pattern instance
on the same variable will lead to exactly the same benefit.
A further strategy to reduce the number of designs to evaluate at a
given sampling step is to uniformly randomly select a subset of the designs
and compute the benefit and find the maximum just for this subset. Even
though the solution provided by this subsampling strategy is clearly sub-
optimal and no clear explanation is known of what is the effect of the
size of the subset with respect to the quality of the solution, we obtained
preliminary experimental evidences (see Section 5.3.3) that this strategy
is extremely effective. More research on this strategy is part of our future
activities.
4.8 Summary
In this Chapter we implemented the MAC sampling algorithm on four
different problems. We added further restrictions to the assumptions made
for the general sampling problem described in Chapter 3. These restrictions
reflect common constraints among the four problems examined. These new
restrictive assumptions allowed us to derive practical solutions.
The first problem (Example 1) deals with the simple game (Number)
of guessing a number in a sequence of attempts: the application of the
4Note that equivalent here means only that in a given sampling step the expected improvement of the
estimates of the target concept is the same, given the data already collected. At a different step, or with
different data previously collected, those designs could be not equivalent at all.
70
CHAPTER 4. MAC: IMPLEMENTATION 4.8. SUMMARY
MAC algorithm yields the binary search algorithm. The second problem
(Example 2) is about estimating the conditional probability of a binary
variable given another binary variable whose sampling we can control. The
third and fourth problems (Application 1 and 2) deal with estimating the
importance of new variables in prediction tasks. In Application 1 the MAC
algorithm, together with other algorithms from literature, is derived for the
target concept of learning the error rate of a MAP classifier while sampling
one new variable. In Application 2 the previous problem is extended to
the case where multiple variables are sampled and evaluated concurrently.
Experiments on the efficacy of sampling with the MAC algorithm with
respect to other policies are deferred to Chapter 5.
71
4.8. SUMMARY CHAPTER 4. MAC: IMPLEMENTATION
72
Chapter 5
Experiments
In this Chapter we describe the experiments conducted to support the
theoretical results and implementations shown in Chapter 3 and Chapter 4.
We start with experiments on Example 2 of Section 4.4 about estimating
the conditional probability between binary variables. Then we show results
about the two applications described in Section 4.5 and Section 4.6 on
evaluating new variables in a learning task.
We performed experiments using synthetic datasets and real life datasets.
Synthetic datasets allows to study the behavior of the proposed algorithms
in known and controlled settings; on these datasets the results are expected
to confirm the theory (in the limit of approximations made) or are used to
investigate boundary behavior. Datasets from real life problems are neces-
sary to assess how the method is able to generalize and perform in realistic
cases.
In all experiments we will start with no values acquired on the vari-
ables under investigation. At each sampling step one missing value will
be selected by the sampling policy and then measured to get the actual
value. The new value together with those already collected will be used to
improve the estimate of the target concept to be learned. After measur-
ing one value the number of unknown values will be reduced of one unit.
73
CHAPTER 5. EXPERIMENTS
The experiment ends when all missing values are disclosed. This means
that, for a dataset having initially N missing values, the sampling process
will consist of N sampling steps leading to N + 1 estimates of the con-
cept to learn (the additional estimate is made at the beginning, before any
measurement is made).
In the following Sections we introduce the experiments details: datasets
descriptions, experimental protocols and results. Final comments on re-
sults will be deferred to Chapter 6.
Note that these experiments describe the only data on which our method
was tested. A selection of the resulting plots was made in a few cases for
lack of space. The selection is meant to be representative of all the results
obtained.
Software implementation of the experiments and algorithms were made
in Python language on top of the opensource softwares NumPy [44] and
SciPy [25].
Performance Assessment and practical use. Every sampling policy consid-
ered in this research computes the benefit of sampling each missing value
at each sampling step1. Frequently many different missing entries share
exact the same benefit. Since we assume that the experimenter is allowed
to measure just one missing entry at a time, she picks one uniformly at
random. This random choice introduces a non-deterministic step in the
process. Since we are interested in investigating the expected performance
of the sampling policies, we need to repeat each experiment many times to
average out random fluctuations due to this random selection. The repe-
titions needed by the performance assessment task lead to a huge increase
of computation during experiments (usually a factor of 100 or 1000). If we
1in case of random sampling policy we can simply assume that it assigns an equal benefit to all missing
values at each step.
74
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
were interested in using a sampling policy just to make the actual decision
on what to sample next, without being interested in average performance
assessment, there would be no need to repeating each experiment. This
means that the practical use of an active sampling policy is more feasible
than doing performance assessment.
5.1 Estimation of Conditional Probabilities
We compare the performance of MAC and random sampling policies when
learning the conditional probabilities of two binary variables X1 and X2,
while sampling data incrementally, as described in Section 4.4.
We generate datasets of N = 100 instances; each dataset is defined by
three parameters:
• a = P (X2 = 0|X1 = 0)
• b = P (X2 = 1|X1 = 1)
• c = P (X1 = 0) = 1 − P (X1 = 1)
We considered 125 instances of the parameters (a, b, c) covering uniformly
the parameter space. The space of each parameter comprises 5 values:
0.1, 0.3, 0.5, 0.7, 0.9 and every combination of triplets from those values
is tested leading to 125 generated datasets and experiments.
The experiment proceeds as follows. Given a dataset and a sampling
policy we start with all values unknown, i.e., we hide all values and disclose
one at a time according to the requests of the sampling policy. The initial
estimates of conditional probabilities, i.e., a0 and b0, are then defined just
by the Beta(A, B) prior distribution as described in Equation 4.13. In these
experiments we assume that A = B = 1, meaning that a0 = b0 = 1/2. We
compute the root mean square (rms) difference between these estimates
and the true a and b values. At this point the sampling policy decides
75
5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS
whether to sample at X1 = 0 or X1 = 1. The corresponding value of X2 is
disclosed and a full record < x1, x2 > is revealed in the dataset. Using this
new information, new estimates a1 and b1 are computed followed by the
new rms error. After this first step, iteratively, the sampling policy decides
again where to sample next: another new pair < x1, x2 > is disclosed and
new estimates of the conditional probabilities and of the rms error can be
computed. After N sampling steps all records in the dataset are disclosed,
the estimates of a and b converge to the true values2 and the experiment
ends.
For each experiment conducted we observe how the rms error between
the estimates of the parameters and the true values evolves at each sam-
pling step. See Figures 5.1, 5.2 and 5.3 as an example of these results.
We observe that the 125 results of the experiments can be summarized in
six groups as described in Table 5.1. We observe that results of experiments
having similar values of parameters a, b and c shows similar plots of the
rms error across the sampling steps.
Group# a b c
1 6= 1
26= 1
26= 1
2
2 6= 1
26= 1
2≃ 1
2
3 6= 1
2≃ 1
2> 1
2
4 6= 1
2≃ 1
2< 1
2
5 ≃ 1
2≃ 1
26= 1
2
6 ≃ 1
2≃ 1
2≃ 1
2
Table 5.1: Brief description of the six groups of results. In the first columns it is shown
the number of the group. In the following columns under a, b and c it is shown for each
group whether the value of the parameters is different, equal/close, greater or lesser than
1/2.
2Since the target of this experiment is to estimate a and b with a finite number of samples, we call
true values of a and b those obtained when all samples in the dataset are disclosed.
76
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
Representative plots of the rms errors at each sampling step for each of
the six groups are shown in Figures 5.1, 5.2 and 5.3.
In the case of random policy all missing entries are considered equivalent:
at each step one record (a pair < X1, X2 >) is selected uniformly at random
from the list of the unknown records and then disclosed. In case of the
MAC sampling policy the benefit of sampling at X1 = 0 or X1 = 1 has
been computed using Equations 4.14 and 4.15. Two benefits are considered
equivalent if they differs less than 10−9, to take into account numerical
instabilities.
Note that each experiment was repeated 1000 times to average out the
random fluctuations due to random choices of the sampling policies.
5.1.1 Detailed Results
The following six groups of results describes six different patterns of be-
havior when comparing the evolution of rms errors of random and MAC
sampling policies.
Group 1: a 6= 1/2 , b 6= 1/2, c 6= 1/2. In this group the true values of a and
b differs from the initial estimates a0 = b0 = 1/2. In case of a = b = 0.1
and c = 0.9 (see top panel of Figure 5.1) the random policy picks X1 = 0
more frequently since P (X1 = 0) > P (X1 = 1). This allows to reduce
quickly the estimation error on a but extremely slowly on b (See top panel
of Figure 5.4). It requires sampling nearly 40 records in order to reduce the
initial estimation error of 90%. MAC sampling policy outperforms random
policy requiring fewer that 20 records.
Group 2: a 6= 1/2 , b 6= 1/2, c ≃ 1/2. The behavior of MAC sampling policy
is identical to the one seen in Group 1 since this policy is not affected by
P (X1) (i.e., the value of c). Random policy P (X1) does not incur in the
77
5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100
Number of records sampled
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
r.m
.s. err
or
N=100 - a=0.1 - b=0.9 - c=0.9 - iter=1000
modnar
CAM
0 20 40 60 80 100
Number of records sampled
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
r.m
.s. err
or
N=100 - a=0.1 - b=0.9 - c=0.5 - iter=1000
modnar
CAM
Figure 5.1: The plots show the rms difference between the true conditional probabilities
and their estimates as the number of sampled records increases (averaged on 1000 rep-
etitions). The upper plot represents the typical behavior of Group 1 where the true
values of a, b and c are far from 1/2. The lower plot represents the behavior of Group 2
where a and b are far from 1/2 but c ≃ 1/2. MAC policy (•) is compared against random
policy(+). 78
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
0 20 40 60 80 100
Number of records sampled
0.00
0.05
0.10
0.15
0.20
0.25
0.30r.
m.s
. err
or
N=100 - a=0.1 - b=0.5 - c=0.9 - iter=1000
modnar
CAM
0 20 40 60 80 100
Number of records sampled
0.00
0.05
0.10
0.15
0.20
0.25
r.m
.s. err
or
N=100 - a=0.1 - b=0.5 - c=0.1 - iter=1000
modnar
CAM
Figure 5.2: The plots show the rms difference between the true conditional probabilities
and their estimates as the number of sampled records increases (averaged on 1000 repe-
titions). The upper plot represents the typical behavior of Group 3 where a if far from
1/2, b ≃ 1/2 and c > 1/2. The lower plot represents the behavior of Group 4 where
a is far from 1/2, b ≃ 1/2 and c < 1/2. MAC policy (•) is compared against random
policy(+). 79
5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100
Number of records sampled
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
r.m
.s. err
or
N=100 - a=0.5 - b=0.5 - c=0.1 - iter=1000
modnar
CAM
0 20 40 60 80 100
Number of records sampled
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
r.m
.s. err
or
N=100 - a=0.5 - b=0.5 - c=0.5 - iter=1000
modnar
CAM
Figure 5.3: The plots show the rms difference between the true conditional probabilities
and their estimates as the number of sampled records increases (averaged on 1000 repeti-
tions). The upper plot represents the typical behavior of Group 5 where the true values
of a = b = 1/2 and c is far from 1/2. The lower plot represents the behavior of Group
6 where a = b = c = 1/2. MAC policy (•) is compared against random policy(+).
80
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
penalty of sampling at X1 = 0 more than X1 = 1 since P (X1 = 0) =
P (X1 = 1) = c ≃ 1/2. The gain in performance of MAC policy becomes
almost negligible, showing that it approximates a policy that alternates
sampling at X1 = 0 to X1 = 1. See the bottom panel of Figures 5.1 and
5.4.
Group 3: a 6= 1/2 , b ≃ 1/2, c > 1/2. Knowing in advance that the initial
estimate of b is good, i.e., b0 ≃ b ≃ 1/2, a cheating sampling policy would
sample initially (and for some time) only at X1 = 0 in order to improve
immediately just the estimate of a. Since P (X1 = 0) = c > 1/2, the
number of records having X1 = 0 is greater than those having X1 = 1.
This unbalanced dataset force the random policy to behave as the cheating
policy just described. This is the reason why in the first sampling steps of
Figure 5.2 and 5.5 (top panels, a = 0.1, b = 0.5 and c = 0.9) random policy
is clearly more efficient than MAC policy. Since random sampling policy
samples X1 = 1 less frequently that MAC policy its squared difference
(bk − b)2 is greater on average after few sampling steps. For this reason
MAC sampling performs better at later sampling steps.
Group 4: a 6= 1/2 , b ≃ 1/2, c < 1/2. The behavior observed in this group of
experiments is exactly the opposite of Group 3: the random policy is the
opposite of the ideal (cheating) sampling policy. MAC policy is unaffected
by P (X1) and performs the same way as in Group 3. The result is analogous
as Group 1 when MAC policy outperforms random policy. See Figure 5.2
and 5.5 (bottom panels, a = 0.1, b = 0.5 and c = 0.1).
Group 5: a ≃ b ≃ 1/2, c 6= 1/2. Since the initial estimates a0, b0 are already
good, sampling new values can only increase the rms difference between
them and the true values. In the long term the unbalanced dataset drives
81
5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS
random policy to an average rms error in estimation bigger than MAC
policy (see Figure 5.3, top panel). The global behavior is similar to that
of Group 1 where MAC policy outperforms random policy. The difference
in the first sampling steps, i.e., random policy performs better than MAC,
is explained by the fact that random policy decreases more rapidly the
squared difference between the estimate and the true value of a (if c > 1/2,
otherwise b) since it samples X1 = 0 more frequently than MAC policy does
(See Figure 5.6, top panel). This aspect is analogous of the behavior seen
in group 3.
Group 6: a ≃ b ≃ c ≃ 1/2. Since random sampling policy is not biased
by the unbalanced dataset (P (X1 = 0) = P (X1 = 1) = c = 1/2) it
samples X1 = 0 and X1 = 1 equally (on average). We observe that the
difference between the two policies is negligible, showing again that the
MAC sampling policy approximates a policy that alternates sampling on
the two values of X1.
5.1.2 Summary of Results
As a final result we quantify and characterize in which cases MAC sampling
policy is better, equivalent or (partially) worse than random policy across
the 125 generated datasets. In Table 5.2 we show the figures.
In exactly zero cases the MAC policy is consistently worse than random
policy across all sampling steps. In a few cases (Group 3 and 5) the random
policy estimates the conditional probability more efficiently than MAC
policy, but only in the very first steps (always before step 15). After
these initial steps, MAC policy is definitely more efficient than random
policy. These cases are those of Group 3 characterized by b = 0.5,
c = 0.1, 0.3 (10 cases), plus those of Group 5 where a = 0.5, b = 0.5
c = 0.1, 0.3, 0.7, 0.9 (4 cases, but 2 in common with group 3).
82
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
0 20 40 60 80 1000.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50a
N=100 - a=0.1 - b=0.9 - c=0.9 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.5
0.6
0.7
0.8
0.9
b
0 20 40 60 80 1000.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
a
N=100 - a=0.1 - b=0.9 - c=0.5 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.4
0.5
0.6
0.7
0.8
0.9
1.0
b
Figure 5.4: Plots of the average estimates of parameters a and b while sampling new
data. MAC algorithm (•) is compared to random algorithm (+). Error bars, representing
one standard deviation, are plotted. Results of the case a = 0.1, b = 0.9 and c = 0.9
(representing Group 1) are plotted on top panel. Results for a = 0.1, b = 0.9 and c = 0.5
(representing Group 2) are on bottom panel. Average and error bars are computed over
1000 repetitions of the experiments. 83
5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 1000.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
a
N=100 - a=0.1 - b=0.5 - c=0.9 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
b
0 20 40 60 80 1000.1
0.2
0.3
0.4
0.5
a
N=100 - a=0.1 - b=0.5 - c=0.1 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
b
Figure 5.5: Plots of the average estimates of parameters a and b while sampling new
data. MAC algorithm (•) is compared to random algorithm (+). Error bars, representing
one standard deviation, are plotted. Results of the case a = 0.1, b = 0.5 and c = 0.9
(representing Group 3) are plotted on top panel. Results for a = 0.1, b = 0.5 and c = 0.1
(representing Group 4) are on bottom panel. Average and error bars are computed over
1000 repetitions of the experiments. 84
CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.
0 20 40 60 80 1000.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70a
N=100 - a=0.5 - b=0.5 - c=0.1 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
b
0 20 40 60 80 1000.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
a
N=100 - a=0.5 - b=0.5 - c=0.5 - iter=1000
modnar
dts1±modnar
CAM
dts1±CAM
0 20 40 60 80 100
Number of records sampled
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
b
Figure 5.6: Plots of the average estimates of parameters a and b while sampling new
data. MAC algorithm (•) is compared to random algorithm (+). Error bars, representing
one standard deviation, are plotted. Results of the case a = 0.5, b = 0.5 and c = 0.1
(representing Group 5) are plotted on top panel. Results for a = 0.5, b = 0.5 and c = 0.5
(representing Group 6) are on bottom panel. Average and error bars are computed over
1000 repetitions of the experiments. 85
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
MAC algorithm and random algorithm behave equivalently in Group 2
and Group 6. Group 2 is characterized by c = 0.5 (25 cases). In Group
6 just by a = 0.5, b = 0.5 and c = 0.5 (1 case, already included in
Group 2).
In all other configurations of the experiments (88 cases) MAC sampling
algorithm outperforms random sampling.
MAC performance # cases % of 125 description
always worse 0 0% never
worse, than better 12 10% b = 0.5, c = 0.1, 0.3, a = 0.5, b = 0.5, c 6= 0.5
equivalent 25 20% c = 0.5
better 88 70% all other cases
Table 5.2: Summary of the results about the comparison of MAC sampling algorithm
vs. random policy over 125 generated datasets. The first column indicates all possible
outcomes of the comparison: “always worse”, i.e., MAC algorithm is always less efficient
than random policy across all sampling steps, “worse, than better”, i.e., MAC is less
efficient just in the first sampling steps (less than 15 steps) and then becomes more
efficient; “equivalent”, i.e the difference between the two policies always is negligible;
“better”, i.e., MAC algorithm is consistently better than random algorithm across all
steps. In the second columns is indicated the number of cases falling in each class; in the
third the percentage over the 125 cases; in the last column each class is described briefly.
5.2 Single Feature Relevance Estimation
We evaluate the performance of MAC sampling policy compared to other
sampling policies in a learning task. The target is to estimate the contri-
bution of adding a new variable X when building a classifier on a class
labelled dataset. Each instance in the dataset is described by the class
label C and some variables X. See Section 4.5 for details.
We conducted experiments on many datasets: synthetic data generated
according to the distributions presented in Figure 5.7, on several datasets
86
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
from the University of California - Irvine (UCI) repository [42], on data
from agriculture describing Apple Proliferation disease and on data from
biomedical experiments on cancer biomarkers.
We compare five sampling policies described in Section 4.5.3:
1. random
2. MAC
3. IME
4. SFL with max-depth = 1
5. SFL with max-depth = 5.
The Single Feature Lookahead (SFL) algorithm was not investigated at
higher max-depth values because the performance was comparable to the
case with max-depth = 5, but required excessive computation.
The evaluation metric is computed as follows. For each choice of the
candidate feature X we calculated the expected error rate ǫfull of a max-
imum a posteriori classifier trained on the entire database (i.e., with all
the values of C, X and X known). Then for a given sample size L we
sampled X values on L samples from the database (by each sampling pol-
icy) and calculate the predicted error rate ǫL for each method. We then
computed the root mean square difference between ǫL and ǫfull over several
runs of the sampling scheme. Under the assumption of unit cost for feature
value acquisition the rms difference will measure the efficacy of a sampling
scheme in estimating the error rate of a classifier trained on both X and
X as a function of the number of feature values acquired.
We note that ǫfull can be viewed as the true error rate of the classifier
that uses the new variable X and ǫL as the estimate of that error rate
after sampling X on L samples. Since our goal is to predict the error rate
87
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
accurately minimizing L, we can measure the effectiveness of our sampling
algorithm by the rms difference between ǫfull and ǫL.
In the following we describe each dataset in detail and provide results
of the experiments made.
~X1~X2
class1 class0
0 1 2
class1
class0
0
1
0 1 2
2
1
0
0 1 2
class1 class0
0 1 2
class1
class0
X X
XX
Figure 5.7: True class-conditional (X, X1) and (X, X2) variables distributions. The data
points before and after measuring the candidate features are also shown.
5.2.1 Synthetic Data
This dataset implements the introductory example of Section 1.1.1. We cre-
ated two datasets of N = 80 instances (or records) each with two columns
corresponding to the class label C and the variable X plus a third column
corresponding to the new unknown variable Xi, i = 1, 2. The values in the
datasets follows the distributions shown in Figure 5.7. This means that
the classifier built on (C, X) has a non null error rate, that X1 does not
improve accuracy of the classifier and X2 does improve classification accu-
racy. X1 is called useless feature and X2 useful feature. The variables were
discretized (indicated by dashed lines in Figure 5.7). For each dataset and
each sampling policy we performed one experiment as described before.
88
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
The plots of the rms difference as a function of the number of samples
probed (L) are shown in Figure 5.8. The top plot shows the average rms
errors in the error rates estimated from data obtained by random and ac-
tive sampling algorithms for feature X1 and the bottom plot is for feature
X2. The averages are computed over the 1000 repetitions of each exper-
iment. In each plot, to compare the different sampling schemes for cost
effectiveness, we must compare the number of feature values sampled for
a required rms error. The value of the rms error at L = 0 indicates the
utility of the new feature X, i.e., the total reduction in error rate by the
adding the new feature.
In both cases shown in Figure 5.8 MAC policy outperforms all other
policies, except IME in the case of the useless variable. Note that for X1
(Figure 5.8, top panel) the rms difference climbs from zero to a higher
value before converging to zero again because when no samples have been
extracted the feature is deemed useless (which it really is). With a few
samples, however, the estimate of the relevance is incorrect.
5.2.2 UCI Benchmark Data
We experimented with several UCI machine learning datasets with cate-
gorical features and class labels. The aim of using these datasets is testing
the sampling policies on benchmark data frequently used by the Machine
Learning community. Here is a list with a brief description of each:
• Solar Flare: data collected counting the number of solar flares of a
certain class that occur in a 24 hour period. Number of instances:
1389. Number of attributes: 10.
• Balance Scale: data generated to model psychological experimental
results. Each example is classified as having the balance scale tip to
the right, tip to the left, or be balanced. Number of instances: 625.
89
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
0 10 20 30 40 50 60 70 80
sample size (L)
0.00
0.01
0.02
0.03
0.04
0.05
0.06
rms d
iffe
rence
Synthetic Data X_new=1 (Useless)
randomMACIMESFL d=1SFL d=5
0 10 20 30 40 50 60 70 80
sample size (L)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
rms d
iffe
rence
Synthetic Data X_new=2 (Useful)
randomMACIMESFL d=1SFL d=5
Figure 5.8: Synthetic Data. The root mean square difference between the estimated and
true error rate after the candidate feature is added as a function of the number of samples
probed for all sampling policies. Rms difference is averaged over 1000 repetitions of the
experiment.
90
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
Number of attributes: 4
• MONKS’s problem: artificially generated data for the first interna-
tional comparison of learning algorithms [57]. Number of instances:
432. Number of attributes: 7.
• Breast Cancer Wisconsin: data describing histological information on
benign and malignant breast cancer samples. Number of instances:
699 (with missing values). Number of attributes: 10.
• Mushroom: mushrooms described in terms of physical characteristics
and class labelled as poisonous or edible. Number of instances: 8124.
Number of attributes: 22.
• Zoo: artificial dataset describing 7 classes of animals. Number of
instances: 101. Number of attributes: 17.
For each dataset we performed the sampling experiment for some config-
urations of randomly chosen pairs of features. From each pair, the first
variable was used as known feature X and the second for the candidate
feature X3. For each configuration and sampling method we plotted the
rms difference between the true and estimated error rate for different num-
ber of acquired samples. Results are shown in Figures 5.9, 5.10, 5.11,
5.12, 5.13 and 5.14. Since the computational complexity of active sam-
pling algorithms is at least linear in the number of instances, the number
of experiments we performed is bigger for small datasets and small feature
space. For the same reason the experiments on larger datasets end before
acquiring all samples.
For lack of space we present only a selection of the plots that are most
representative in Figures 5.9, 5.10, 5.11, 5.12, 5.13 and 5.14. As those
3For few cases, we chose three features and used the first two as already known features, and the last
one as candidate feature.
91
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
Figures shows, the MAC sampling policy performs better than random
policy in every case and better that any other policy in the majority of the
cases. MAC policy is always better for assessing the relevance of useless
variables.
5.2.3 Data from Agriculture Domain
We evaluated the different sampling strategies on a small dataset provided
by the biologists which, after preprocessing, has 520 instances (apple trees)
with one binary class variable: the presence or absence of the phytoplasma
determined by the Enzyme-Linked ImmunoSorbent Assay (ELISA) chem-
ical test [15, 52]. Three binary features were considered:
• Presence of witches brooms on the trees, i.e., a proliferation of sec-
ondary shots near the apex of the main shoot, in summer.
• Reddening of leaves.
• Presence of enlarged stipulae.
We performed the sampling experiments using the six possible combi-
nations of pairs of features. For each pair the first feature was used as
previously known feature X and the second as new feature X . In Table 5.3
we show the number of values of the candidate feature that have to be
acquired for the rms difference between the estimated and true error rate
to be less than 0.005. For comparison we note that the true error rate is
calculated using all 520 samples.
An example of the results is plotted in Figure 5.15 showing the per-
formance of the sampling policies when the presence of witches brooms is
known and the reddening of leaves is acquired incrementally. In all cases
of these experiments IME and SFL (d = 5) sampling policies performed
worse than random policy. In all six cases MAC policy outperforms all
other policies.
92
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
0 50 100 150 200 250 300 350
sample size
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045rm
s d
iffe
rence
UCI Solar Flares dataset : X = 3 - X_new = 4
randomMACIMESFL d=1SFL d=5
0 50 100 150 200 250 300 350
sample size
0.00
0.02
0.04
0.06
0.08
0.10
0.12
rms d
iffe
rence
UCI Solar Flares dataset : X = 7 - X_new = 8
randomMACIMESFL d=1SFL d=5
Figure 5.9: Solar Flares dataset. The average root mean square difference between the
estimated and true error rate after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms difference is averaged over 100
repetitions of the experiment.
93
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
0 50 100 150 200 250 300
sample size
0.00
0.01
0.02
0.03
0.04
0.05
0.06
rms d
iffe
rence
UCI Balance dataset : X = 1 - X_new = 3
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100
sample size
0.01
0.02
0.03
0.04
0.05
0.06
rms d
iffe
rence
UCI Balance dataset : X = 3 - X_new = 2
randomMACIMESFL d=1SFL d=5
Figure 5.10: Balance Scale dataset. The average root mean square difference between the
estimated and true error rate after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms difference is averaged over 100
repetitions of the experiment. Due to the large size of the feature space only the first 100
samples are acquired. 94
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
0 50 100 150 200 250 300 350 400 450
sample size
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035rm
s d
iffe
rence
UCI Monks dataset : X = 5 - X_new = 6
randomMACIMESFL d=1SFL d=5
0 50 100 150 200 250 300 350 400 450
sample size
0.00
0.02
0.04
0.06
0.08
0.10
0.12
rms d
iffe
rence
UCI Monks dataset : X = 6 - X_new = 2
randomMACIMESFL d=1SFL d=5
Figure 5.11: Solar Flares dataset. The average root mean square difference between the
estimated and true error rate after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms difference is averaged over 100
repetitions of the experiment.
95
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
0 50 100 150 200
sample size
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
rms d
iffe
rence
UCI Breast Cancer dataset : X = 3 - X_new = 4
randomMACIMESFL d=1
0 50 100 150 200
sample size
0.015
0.020
0.025
0.030
0.035
0.040
rms d
iffe
rence
UCI Breast Cancer dataset : X = 5 - X_new = 3
randomMACIMESFL d=1
Figure 5.12: Breast Cancer Wisconsin dataset. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
100 repetitions of the experiment. Due to the large size of the feature space only the first
100 samples are acquired. 96
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
0 20 40 60 80 100
sample size
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30rm
s d
iffe
rence
UCI Mushroom dataset : X = 21 - X_new = 5
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100
sample size
0.00
0.05
0.10
0.15
0.20
0.25
rms d
iffe
rence
UCI Mushroom dataset : X = 15 - X_new = 6
randomMACIMESFL d=1SFL d=5
Figure 5.13: Mushroom dataset. The average root mean square difference between the
estimated and true error rate after the candidate feature is added as a function of the
number of samples probed for all sampling policies. Rms difference is averaged over 100
repetitions of the experiment. Due to the large size of the feature space only the first 100
samples are acquired. 97
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100 120
sample size
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
rms d
iffe
rence
UCI Zoo dataset : X = 13 - X_new = 15
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120
sample size
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
rms d
iffe
rence
UCI Zoo dataset : X = 2,11 - X_new = 1
randomMACIMESFL d=1SFL d=5
Figure 5.14: Zoo dataset. The average root mean square difference between the estimated
and true error rate after the candidate feature is added as a function of the number of
samples probed for all sampling policies. Rms difference is averaged over 100 repetitions
of the experiment.
98
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
0 100 200 300 400 500 600
sample size (L)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
rms d
iffe
rence
S.M.A.P. project : X = 1 - X_new = 2
randomMACIMESFL d=1SFL d=5
Figure 5.15: Apple Proliferation dataset. The average root mean square difference be-
tween the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
100 repetitions of the experiment.
99
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
Features Random MAC IME SFL SFL
X X (d = 1) (d = 5)
1 2 414 379 470 420 463
1 3 394 308 467 398 470
2 1 444 332 483 444 485
2 3 393 320 462 401 464
3 1 445 333 484 443 486
3 2 414 394 464 420 458
Table 5.3: Agriculture Data. The number of samples required (out of a total of 520)
for the rms difference between the true and estimated error rate to be less than 0.005
for various sampling algorithms. Each row corresponds to one selection of the previous
feature X and the candidate feature X. The rms values are computed over 1000 runs.
5.2.4 Data from Biomedical Domain
To provide evidence that our method is effective in reducing costs in the
biomedical domain we experimentally evaluated our method on a breast
cancer Tissue Microarray dataset4. Although the biomarker evaluation
problem is not relevant for this particular dataset we use it to demonstrate
the utility of our approach.
The dataset was acquired using the recently developed technique of
Tissue Microarray [28] that improves the in-situ experimentation process
by enabling the placement of hundreds of samples on the same glass slide.
Core tissue biopsies are carefully selected in morphologically representative
areas of original samples and then arrayed into a new ”recipient” paraffin
block, in an ordered array allowing for high-throughput in situ experiments.
For each patient there is a record that describes her clinical, histological
and biomarkers information. The entire dataset consisted of 400 records
defined by 11 features. Each of the clinical features is described by a binary
4The data used for experimentation was collected by the Department of Histopathology and the
Division of Medical Oncology, St. Chiara Hospital, Trento, Italy. Tissue Microarray experiments were
conducted at the Department of Histopathology [12].
100
CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.
status value and a time value. Some of the records have missing values.
The data are described by features as in Table 5.4.
Clinical Features
1. the status of the patient (binary, dead/alive) after a certain amount of time
(in months, integer from 1 to 160)
2. the presence/absence of tumor relapse (binary value) after a certain amount
of time (in months, integer from 1 to 160 months)
Histological Features
3. diagnosis of tumor type made by pathologists (nominal, 14 values)
4. pathologist’s evaluation of metastatic lymphonodes (integer valued)
5. pathologist’s evaluation of morphology (called grading, ordinal, 4 values)
Biomarkers Features (manually measured by experts in TMA)
6. Percentage of nuclei expressing ER (estrogen receptor) marker.
7. Percentage of nuclei expressing PGR (progesterone receptor) marker.
8. Score value (combination of color intensity and percentage of stained area
measurements) of P53 (tumor suppressor protein) maker in cells nuclei.
9. Score value (combination of col-our intensity and percentage of stained area
measurements) of cerbB marker in cells membrane.
Table 5.4: Features describing biomarkers data.
The learning task defined on this dataset is the prediction of the status
of the patient (dead/alive or relapse) given some previous knowledge (his-
tological information or known biomarkers). The goal is to choose the new
biomarker which can be used along with the histological features that pro-
vides accurate prediction. The experiments address the issue of learning
which additional feature has to be sampled.
The dataset was preprocessed as follows. Continuous features were dis-
cretized to reduce the level of detail and to narrow the configuration space
for the sampling problem. Discretized feature were encoded into binary
variables according to the convention suggested by experts in the domain.
We designed 10 experiments corresponding to different learning situa-
tions. The experiments differ in the choice of attribute for the class label
101
5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS
(C), the attributes used as the previous features (X) and the feature used
as the new candidate feature (X). The various configurations are shown
in Table 5.5.
Exp. Class Label (C) Known Features (X) New Feature (X) Size (#)
I dead/alive all histological information PGR 160
II dead/alive all histological information P53 164
III dead/alive all histological information ER 152
IV dead/alive all histological information cerbB 170
V relapse all histological information PGR 157
VI relapse all histological information P53 161
VII relapse all histological information ER 149
VIII relapse all histological information cerbB 167
IX dead/alive PGR, P53, ER cerbB 196
X relapse PGR, P53, ER cerbB 198
Table 5.5: Configurations of experiments on biomarkers data.
For the empirical evaluation we performed an additional preprocessing
step of removing all the records with missing values for each experiment
separately. For this reason the sizes of datasets used for different experi-
ments are different.
Experiments and Results
For each of the 10 experimental configurations described above, all sam-
pling policies are compared for different number of acquired samples.
For each experiment we plotted the average rms value against the num-
ber of samples probed which are shown in Figure 5.17 5.18 5.19 5.20 5.21.
The average is performed over 500 repetitions of each experiments. In each
plot, to compare the MAC sampling scheme to the other methods for cost
effectiveness, we must compare the number of feature values sampled for
102
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
a required rms error.
We observe from the plots that our MAC active sampling algorithm is
significantly better in reducing the number of samples needed for error rate
estimation than all other sampling policy in almost all cases.
In terms of the biomedical problem, this implies that using the MAC ac-
tive sampling, we can evaluate a higher number of biomarkers on the same
amount of bio-sample resource than using the standard random sampling
method or other methods available in literature.
5.3 Multiple feature relevance estimation: experi-
ments
We evaluate the performance of MAC sampling policy compared to the
random sampling policy in a learning task. The target is to assess the
relevance of a set of variables with respect to predicting the class labels
describing a set of instances. The relevance assessment is done on all
variables while sampling them concurrently as described in Section 4.6.
We conducted experiments on synthetic data and on datasets from the
University of California - Irvine (UCI) repository [42]. We plan to conduct
experiments on datasets from apple proliferation disease in agriculture and
from biomarkers in medicine (see Section 4.5.1) only as a future activity.
For a particular dataset, the experimental setup is as follows. We start
with the assumption that the class labels for all the samples are initially
known and all of the feature values are missing. At each sampling step a
single missing entry in the dataset is selected by the sampling policy and
the actual value in the dataset is disclosed. The experiment ends when
all entries of the dataset are sampled and all the original feature values
are fully disclosed. After each sample is disclosed we estimate the feature
relevances from all the data that is currently available, which are compared
103
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100 120 140 160
sample size (L)
0.000
0.005
0.010
0.015
0.020
0.025
rms d
iffe
rence
Biomarkers exp.1
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120 140 160 180
sample size (L)
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
rms d
iffe
rence
Biomarkers exp.2
randomMACIMESFL d=1SFL d=5
Figure 5.16: Biomarkers experiment I and II. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
500 repetitions of the experiment.
104
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
0 20 40 60 80 100 120 140 160
sample size (L)
0.000
0.005
0.010
0.015
0.020
0.025rm
s d
iffe
rence
Biomarkers exp.1
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120 140 160 180
sample size (L)
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
rms d
iffe
rence
Biomarkers exp.2
randomMACIMESFL d=1SFL d=5
Figure 5.17: Biomarkers experiment I and II. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
500 repetitions of the experiment.
105
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100 120 140 160
sample size (L)
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
rms d
iffe
rence
Biomarkers exp.3
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120 140 160 180
sample size (L)
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
rms d
iffe
rence
Biomarkers exp.4
randomMACIMESFL d=1SFL d=5
Figure 5.18: Biomarkers experiment III and IV. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
500 repetitions of the experiment.
106
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
0 20 40 60 80 100 120 140 160
sample size (L)
0.000
0.005
0.010
0.015
0.020
0.025
0.030rm
s d
iffe
rence
Biomarkers exp.5
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120 140 160 180
sample size (L)
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
rms d
iffe
rence
Biomarkers exp.6
randomMACIMESFL d=1SFL d=5
Figure 5.19: Biomarkers experiment V and VI. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
500 repetitions of the experiment.
107
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100 120 140 160
sample size (L)
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
rms d
iffe
rence
Biomarkers exp.7
randomMACIMESFL d=1SFL d=5
0 20 40 60 80 100 120 140 160 180
sample size (L)
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
rms d
iffe
rence
Biomarkers exp.8
randomMACIMESFL d=1SFL d=5
Figure 5.20: Biomarkers experiment VII and VIII. The average root mean square differ-
ence between the estimated and true error rate after the candidate feature is added as
a function of the number of samples probed for all sampling policies. Rms difference is
averaged over 500 repetitions of the experiment.
108
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
0 50 100 150 200
sample size (L)
0.000
0.001
0.002
0.003
0.004
0.005
0.006rm
s d
iffe
rence
Biomarkers exp.9
randomMACIMESFL d=1SFL d=5
0 50 100 150 200
sample size (L)
0.000
0.005
0.010
0.015
0.020
0.025
rms d
iffe
rence
Biomarkers exp.10
randomMACIMESFL d=1SFL d=5
Figure 5.21: Biomarkers experiment IX and X. The average root mean square difference
between the estimated and true error rate after the candidate feature is added as a function
of the number of samples probed for all sampling policies. Rms difference is averaged over
500 repetitions of the experiment.
109
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
to the true feature relevance values (the feature relevances estimated from
the entire dataset). The comparison measure is the average rms of errors,
which is plotted as a function of the number of missing entries filled thus
far. The average is computed over 100 sampling runs to reduce fluctuations
introduced by random selection of entries, in case of multiple equivalent
choices occurring at certain steps. The plots show the comparison of our
active sampling algorithm (MAC) to the random sampling algorithm.
Although the models we presented are general, we only experimented
with mixture distributions (cf. Section 4.6.4) of only one component per
class (i.e., a Naıve Bayes model). We did not perform experiments with a
higher number of components because of estimation problems during the
initial sampling steps and also because of computational issues. In the
future we intend to develop methods to adjust the number of components
depending on the amount of data available at any sampling step.
5.3.1 Synthetic Data
We now describe how the synthetic dataset was generated. We created a
dataset of size N = 200 samples with binary class labels and three binary
features with exactly 100 records per class (i.e., p(c = 0) = p(c = 1) =
0.5). The features are mutually class-conditionally independent and with
different relevances to the class labels.
The feature values are generated randomly according to the following
scheme. For feature Xi we generate the feature values according to the
probability P (Xi = 0|C = 0) = p(Xi = 1|C = 1) = pi. Clearly if pi is
closer to 0 or 1 the feature is more relevant for classification than if pi
is closer to 0.5. For our three features we chose p1 = 0.9, p2 = 0.7 and
p3 = 0.5 meaning that the first feature is highly relevant and the third is
completely irrelevant for classification. The true feature relevances (mutual
information values) are r1 = 0.37, r2 = 0.08 and r3 = 0 respectively.
110
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
Since by construction there is no inter-feature dependence given the
class, we conducted experiments using a product distribution for each class
(i.e., a mixture of just one component). The average rms distance between
the estimated and true feature relevances is plotted as function of the
number feature values sampled in Figure 5.22 for both the random and
MAC sampling policies.
The graph in Figure 5.22 shows that our proposed active scheme clearly
outperforms the random acquisition policy. For example note that in order
to reduce the difference between estimated and true relevances to a fourth
of the initial value (when all feature values are missing) the random policy
requires 45 samples instead of 30 by MAC policy.
In Figure 5.23 we show, separately, the estimates of each of the indi-
vidual feature relevances. In Figure 5.24 we show the average number of
times each feature is sampled as a function of the number of samples. We
observe that the frequency with which a feature is sampled is correlated
to its relevance. This is a desirable property because the least relevant
features will eventually be discarded and therefore sampling them would
be wasteful.
5.3.2 UCI Datasets
We performed experiments on the Zoo, Solar Flares, Monks and Cars
datasets from UCI repository. See Section 5.2.2 for a brief description
of them. These datasets present larger class label spaces (from 2 to 6
classes) and an increased number of features (from 6 to 16). Also some of
the features take on more values (from 2 to 6 values) than our artificial
datasets. Figures 5.25 and 5.26 show the plots of the average rms error
between the estimated and ’true’ feature relevances as a function of the
number of samples acquired for both the MAC and the random sampling
schemes. The error values are normalized such that at step 0 (i.e, when
111
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100
sample size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
rms d
iffe
rence
Synthetic
randomactive
Figure 5.22: Average rms differences between estimated and true relevances at each sam-
pling step on artificial data for random and MAC policies. Average performed over 100
repetitions of the experiment. Only the first 100 sampling steps are shown to improve
readability. Note that true relevances are those computed from the full dataset, and not
the theoretical ones which are slightly different.
112
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
0 100 200 300 400 500 600
sample size
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Rele
vance
feature 1
feature 2
feature 3
Synthetic
randomactive
Figure 5.23: Estimated relevances at each sampling step for every single feature on ar-
tificial data. Random (dashed line) and MAC (solid-dotted line) policies are compared.
Since there are three features and 200 instances the x axis goes to 600.
none of the missing entries has been filled) the error is 1.0.
Figures 5.25 and 5.26 illustrate the advantage of the active sampling
policy over the random scheme in reducing the number of feature samples
necessary for achieving comparable accuracy. We note that in order to
reduce the estimation error of feature relevances to one fourth of the initial
value, the number of samples required is 25% – 75% lower for the MAC
policy than for the random policy. Again, we observed in all datasets
that most relevant features are sampled more frequently than less relevant
features.
5.3.3 Computational Complexity Issues
The computational complexity of MAC sampling algorithm due to the ex-
pensive EM estimation (which is repeated for every missing entry, every
possible feature value and every iteration) limits its applicability to large
113
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 100 200 300 400 500 600
sample size
0
50
100
150
200
250
Sam
pling C
ounts
Synthetic
feature 1,2,3 - random
feature 1 - activefeature 2 - activefeature 3 - active
Figure 5.24: Average cumulative sampling counts at each sampling step for each feature
on artificial data. The more relevant features are sampled more frequently than less
relevant features in case of MAC policy. As a comparison, the random policy samples
features independently of their relevance.
114
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
0 50 100 150 200 250
sample size
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0rm
s d
iffe
rence
Zoo
randomactive
0 50 100 150 200 250
sample size
0.2
0.4
0.6
0.8
1.0
1.2
rms d
iffe
rence
Monks
randomactive
Figure 5.25: The normalized difference between final relevances and estimated relevances
at each sampling step is plotted for random and MAC policies on Zoo dataset (top panel)
and Monks datasets (bottom panel). The value at step 0 (all feature values unknown) is
normalized to 1.0.
115
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 50 100 150 200 250
sample size
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
rms d
iffe
rence
Solar Flares
randomactive
0 50 100 150 200 250
sample size
0.4
0.6
0.8
1.0
1.2
rms d
iffe
rence
Cars
randomactive
Figure 5.26: The normalized difference between final relevances and estimated relevances
at each sampling step is plotted for random and MAC policies on Solar Flares dataset
(top panel) and Cars dataset (bottom panel). The value at step 0 (all feature values
unknown) is normalized to 1.0.
116
CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.
datasets. One way we reduced the computational expense was to memo-
ize5 the calculation of the benefit function for equivalent entries (i.e., en-
tries having same non-missing feature values, thus having the same benefit
value). Another strategy to reduce computation is to perform sub-optimal
active sampling by considering only a random subset of the missing entries
at each time step. This latter strategy can be used to trade-off sampling
cost versus computational cost.
In Figure 5.27 (upper panel) the MAC and random policies are shown
together with MAC policy that considers 0.1% and 1% of the missing en-
tries (randomly selected) at each sampling step; results are based on the
artificial dataset described in Section 5.3.1. We observe that the dominance
of MAC policy compared to the random one increases with the subsam-
ple size. A similar experiment was performed on Solar Flares dataset (see
Figure 5.27, bottom panel) where MAC and random policies are plotted
together with MAC policy that considers 0.05%, 0.25% and 1% of the miss-
ing entries (randomly selected). Again we observe that performing MAC
policy considering a random sub-portion of the dataset (0.25% of the total
number of missing entries at any instance) is an effective strategy to obtain
a reduction in the number of samples acquired at reduced computational
cost.
More investigations on this preliminary evidence is needed in order to
confirm and understand better this desirable property.
5Memoization [40, 43] is a programming technique that stores the input and the output of a deter-
ministic function in a hash table. The purpose is to avoid the exact same computation multiple times, for
performance reasons. The next time that same computation will be requested, the result will be retrieved
from the hash table instead of computing it again.
117
5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS
0 20 40 60 80 100
sample size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
rms d
iffe
rence
Synthetic
randomactive 0.1%active 1%active
0 50 100 150 200 250
sample size
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
rms d
iffe
rence
Solar Flares
randomactive 0.05%active 0.25%active
Figure 5.27: Average squared sum of the differences between estimated and true relevances
at each sampling step on artificial and UCI Solar Flares datasets. Random and MAC
policies are compared to the active active policy that considers only a small random
subset of the missing entries at every sampling step.
118
CHAPTER 5. EXPERIMENTS 5.4. SUMMARY
5.4 Summary
In this chapter we experimentally evaluated the efficacy of the MAC sam-
pling algorithm with respect to other policies on three problems described
in Chapter 4 (i.e., Example 2, Application 1 and 2). Example 2, estima-
tion of the conditional probability, was tested on 125 synthetic datasets.
MAC sampling policy outperformed the random sampling policy in most
of the cases; in all other cases they showed an equivalent behavior. In
Application 1, assessing the relevance of a new variable, the comparison
of MAC algorithm to other methods (random sampling and others from
the literature) was carried out on multiple datasets, from a simple syn-
thetic dataset to more complex cases. The latter are benchmark datasets
(from UCI repository [42]) and two real life domains (agricultural data and
biomarkers data). In all cases MAC algorithm outperformed random sam-
pling and proved to be definitely the best compared to other algorithms. In
Application 2, assessing the relevance of multiple variable concurrently, the
comparison of MAC algorithm and random algorithm was tested on syn-
thetic and UCI datasets. MAC algorithm showed a significant reduction
in the number of data to acquire to reach the same accuracy as random
sampling. Preliminary evidence of the efficacy of a technique to reduce the
computational costs of the MAC policy was also described.
119
5.4. SUMMARY CHAPTER 5. EXPERIMENTS
120
Chapter 6
Conclusions and Directions for
Future Work
In this Chapter we summarize the main achievements of this research and
illustrate new directions for future work.
6.1 Main Contributions
In Chapter 1 we introduced the problem of active feature sampling using
two examples on feature relevance estimation. A generalization of the
problem was presented as well. We motivated this work on the basis of two
domain of application: studying apple proliferation disease in agriculture
and biomarkers for cancer characterization in medicine.
Maximum Average Change (MAC) Algorithm. In Chapter 3 we introduced
a novel sampling algorithm derived from the theory of Bayesian experimen-
tal design. It rates each sampling candidate according to its expected con-
tribution to estimating a target concept. The algorithm, called Maximum
Average Change (MAC) sampling algorithm, is based on the computation
of a benefit function, which estimates the squared difference between the
current MMSE estimate of the concept and the one at the next sampling
121
6.1. CONTRIBUTIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK
step. In Section 3.3 we prove that by maximizing this quantity we mini-
mize the mean quadratic error of the current estimate with respect to the
true value of the concept. As a result, from the hard problem of comput-
ing expectations over the true value of the target concept, we move to a
simpler problem of averaging over the possible results at the next sampling
step.
Information Theoretic Interpretation. In Section 3.3.1 we provided an in-
tuitive interpretation of MAC sampling algorithm in terms of information
theory. Intuitively, the most interesting missing entry to sample is the one
that will give most information with respect to the target concept and the
data collected so far. Under the approximation that maximizing the dis-
tance between two distributions is equivalent to maximizing the distance
between their expected values we obtain the MAC algorithm.
Example 1: Learning a Step Function. As a first implementation of the
MAC algorithm we derived its application to the game of guessing a num-
ber, iteratively. In Section 4.3 we showed that the familiar binary search
algorithm maximizes the benefit function derived from the problem, an
indication of the MAC algorithm’s effectiveness.
Example 2: Estimation of Conditional Probability. A second implementa-
tion of the MAC sampling algorithm was derived for the problem of es-
timating the conditional probability between two binary variables. Full
derivation of the benefit functions was provided together with a brief anal-
ysis of their properties. See Section 4.4.
Application 1: Single Feature Relevance Estimation. The first main appli-
cation of the MAC algorithm, that actually initiated our interest in this
122
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.1. CONTRIBUTIONS
topic, deals with the estimation of the error rate reduction caused by a
new variable when added to a class labelled dataset. We present this ap-
plication with a more detailed introduction to the problem of biomarkers
for cancer characterization and the data collection issues of studying the
apple proliferation phytoplasma. In Section 4.5, efficient assessment of the
error rate reduction, i.e., using as few measurements of the new variable
as possible, is derived using the MAC algorithm and other sampling algo-
rithms adapted from the literature. The latter are Single Feature Looka-
head (SFL) algorithm (for depth equal to 1 and 5) and the Goal Oriented
Data Acquisition (GODA) algorithm. A MAP classifier over categorical
data is postulated to represent the prediction problem and to evaluate the
comparison, between MAC and competitive algorithms.
Application 2: Concurrent Estimation of Multiple Feature Relevances. The
second main application of the MAC algorithm, investigated in this re-
search, deals with assessing the relevance of multiple variables concurrently.
This application is presented in Section 4.6. Relevance of a feature is de-
fined as the contribution of a variable to predicting the class labels. In
this application we model the underlying joint class-and-features probabil-
ity distribution with a class-dependent mixture model of product distri-
butions. We infer the parameters from incomplete data via Expectation-
Maximization algorithm and compute the components of the benefit func-
tion with the inferred parameters. A brief analysis of the (non-)identifiability
problems of the data generation model and the computational issues re-
lated to this application are provided.
Experiments on Simulated and Real Test Data. We conducted several ex-
periments on synthetic and real data. In the experiments on Example 2
(conditional probability of two binary variables) the MAC algorithm out-
123
6.2. CONCLUSIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK
performed random sampling in 70% of the cases, performed equivalently
in 20% and in the remaining 10% MAC was less efficient than random just
in the very beginning of the sampling process (in the first 10-15 steps) and
then outperformed random sampling again.
In Application 1 (single feature relevance estimation) MAC policy al-
most always outperformed the random policy on every tested dataset; on
these datasets it performed significantly better than any other method
found in the literature. Thus, the MAC sampling algorithm demonstrated
its usefulness in both domains of application (agriculture and medicine).
Experiments on Application 2 (concurrent estimation of the relevances
of multiple features) showed that the MAC algorithm outperforms random
sampling both on synthetic data and real data from benchmark datasets.
Experimentally we observe two useful properties of the MAC algorithm: it
samples more the most relevant features and less the irrelevant features.
To reduce the computations we can evaluate just a small (1%) random
subset of the candidates (i.e., missing entries) and still obtain most of the
efficiency of the full method. This last property needs more investigation.
6.2 Conclusions
This research introduces a novel strategy for collecting data in problems
where collection is expensive and can be done incrementally. It demon-
strates the effectiveness of this strategy for reducing the number of mea-
surements needed to make inferences in comparison to those required by
other sampling methods.
A drawback of goal oriented sampling policies is the reuse of the data
for different targets, which requires careful evaluation. The random sam-
pling policy does not suffer this drawback. Throughout the investigations
made in this research, the random sampling policy has been an enduring
124
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.3. NEW DIRECTIONS
competitor, much more than expected.
The proposed sampling strategy, resulting in the Maximum Average
Change sampling algorithm, was derived in its general form from the theory
of Bayesian experimental design. Even though its practical implementation
can be challenging even in simple cases, we propose its use in all data
collection problems under hard budgets.
6.3 Directions for Future Work
The novelty of our approach and the little scientific literature available on
this topic had the consequence that many unexplored problems and ideas
emerged during this study. In the following we describe some of these
directions which, we believe, are the main topics that should be addressed
from now on.
Relaxing Restrictive Assumptions. In Chapter 4 we introduced some re-
strictive assumptions in order to match the requirements of the problems
presented there and to allow a detailed implementation of the MAC algo-
rithm. These restrictions were about:
• Using categorical data.
• Measuring one missing value of the dataset at a time.
• Knowing class labels in advance (only for class labelled datasets).
• Using a flat cost model.
Each of these items defines a new direction for future research, both from
the point of view of of the results of this work and of the current literature.
125
6.3. NEW DIRECTIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK
Reduce Computational Complexity. The computational burden of comput-
ing the MAC benefit for each candidate to be sampled is a major issue for
extending experiments for average performance assessment. Moreover all of
the proposed new directions, that aim at relaxing some of the assumptions
just described, have a huge impact in terms of increased computational
complexity. This is a major issue for this research and only preliminary
attempts to solve it were provided (i.e., the subsampling strategy).
Stopping Criterion. A question that arises when sampling expensive data
is when to stop sampling. This research does not address this issue and
relies on the domain of application for an answer. But this issue can be
thought as part of the data collection. We found evidence that, in the initial
sampling steps, the MAC algorithm could perform worse than random
sampling in few cases. Knowing when (and until when) this undesirable
effect holds, would improve the efficacy of using the MAC algorithm. To
the best of our knowledge only Williams in [62] has provided a solution to
this kind of problem, but his solution is applicable only to his proposed
method and context. A general criterion to stop sampling that can be
applied in general is still unknown.
Besides the new directions for medium and long term research described
so far, we plan short term activities as well. They involve mainly Applica-
tion 2, which is the most recent problem we studied. The activities we plan
to work on are: a comparison of Application 2 with Application 1, moti-
vated by the fact that they provides a solution to the same question (how
to improve a prediction model); the second activity is to extend the feature
rater to other algorithms like Relief [26], SIMBA [20] or I-Relief [55]; the
third activity is to test Application 2 on the datasets from agriculture and
medicine presented in this research; a fourth is exploring and characteriz-
ing the subsampling strategy introduced in Section 5.3.3; as a fifth task we
126
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.3. NEW DIRECTIONS
need to extend experiments when the data generation model is a mixture
of more than one component.
127
Appendix A
EM for a Mixture of Product
Distributions
In the following we derive Equations 4.39 and 4.43 as application of the
Expectation-Maximization (EM) algorithm (see Section 4.6.7) to the esti-
mation of class-conditional mixture of product distributions (Equation 4.25)
introduced in Section 4.6.4. The two derivations refer to the cases of com-
plete and missing data.
A.1 Complete Data
The class-conditional mixture of product distributions of Section 4.6.4 is
defined by
P (X1 = x1, . . . , XF = xF |C = c) =M∑
m=1
αcm
F∏
f=1
Vf∏
x=1
θδ(x,xf )cmfx (A.1)
where αs and θs are the parameters defining the model. From now on we
focus on one class without losing generality:
P (X1 = x1, . . . , XF = xF ) =M∑
m=1
αm
F∏
f=1
Vf∏
x=1
θδ(x,xf )mfx (A.2)
129
A.1. COMPLETE DATA APPENDIX A. EM
and call θ the whole set of parameters defining the distribution, i.e.,
θ = θmfxm∈M,f∈F,x∈Vf, αmm∈M. (A.3)
In the case of complete data (i.e. no missing values) the latent variables
z of the complete model denotes which of the components of the mixture
actually generated each record in the dataset. We define Z as the N × M
matrix of the latent variables, where Znm = 1 if record xn has been gener-
ated by component m and Znm = 0 otherwise. Note that∑M
m=1 Znm = 1.
Then
p(x|Z•m = 1, θ) =F∏
f=1
Vf∏
x=1
θδ(x,xf )mfx (A.4)
and
αm = p(Z•m = 1|θ) =1
N
N∑
n=1
p(Znm = 1|xn, θ). (A.5)
The E-step of the EM algorithm involve the calculation of
Q(θ, θt) = EZ[lc(θ|D,Z)|D, θt] (A.6)
which is given by
Q(θ, θt) =M∑
m=1
N∑
n=1
p(Znm=1|xn, θt)ln[p(xn|Znm= 1, θ)p(Znm=1|θ)] = (A.7)
=M∑
m=1
N∑
n=1
p(Znm=1|xn, θt)
F∑
f=1
Vf∑
x=1
δ(x, Xnf) ln θmfx
+p(Znm=1|θ)
and the M-step is
θt+1 = arg max
θ∈ΘQ(θ, θt). (A.8)
In the Equation A.7 the last part of Q(θ, θt) involves two terms
A =
F∑
f=1
Vf∑
x=1
δ(x, Xnf) ln θmfx
B = p(Znm = 1|θ) (A.9)
130
APPENDIX A. EM A.1. COMPLETE DATA
which can be maximized separately. We call QA(θ, θt) and QB(θ, θt) the
two parts of Q. In both cases it is a constrained maximization and we
use the Lagrange multipliers to maximize them. The respective set of
constraints are
Vf∑
x=1
θmfx = 1
M∑
m=1
p(Znm = 1|θ) = 1.
Consider the first term
LA(θ) = QA(θ, θt) −M∑
m=1
F∑
f=1
λmf
Vf∑
x=1
θmfx (A.10)
is maximized it when∂LA
∂θ= 0 (A.11)
which means
N∑
n=1
δ(x, Xnf)∂ ln θmfx
∂θmfx
p(Znm = 1|xn, θt) − λmf = 0. (A.12)
rewriting, summing over x, and taking the constraint into account we get
λmf =N∑
n=1
p(Znm = 1|xn, θt) (A.13)
which reinserted in the previous equations gives the first of the two update
equations
θt+1mfx =
∑Nn=1 δ(x, Xnf)hnm∑N
i=1 hnm
(A.14)
where
hnm = p(Znm = 1|xn, θt) =
αm
∏Ff=1 θt
mjxnf∑Mm=1 αm
∏Ff=1 θt
mjxnf
. (A.15)
131
A.2. INCOMPLETE DATA APPENDIX A. EM
Considering QB(θ, θt), again we maximize it when
∂LB
∂θ= 0 (A.16)
which means
∂QB(θ, θt) −∑M
m=1 λmp(Znm = 1|θ)
∂p(Znm = 1|θ)= 0. (A.17)
Expanding the partial derivative we get
N∑
n=1
p(Znm = 1|xn, θt)
∂ ln p(Znm = 1|θ)
∂p(Znm = 1|θ)− λm = 0 (A.18)
rewriting, summing over m and considering the constraint we obtain λm
λm =M∑
m=1
N∑
n=1
p(Znm = 1|xn, θt) (A.19)
which reinserted in the last equation gives
p(Znm = 1|θ) =
∑Mm=1
∑Nn=1 p(Znm = 1|xn, θ
t)∑M
m=1
∑Nn=1 p(Znm = 1|θt)
(A.20)
which means
αt+1m =
1
N
N∑
n=1
hnm. (A.21)
A.2 Incomplete Data
When some entries in the dataset are missing, we consider them as new
latent variables. The application of the EM algorithm is identical to the
previous case, with the only exception of making expectations over the
missing data as well.
The E-step is
Q(θ, θt) = E[lc(θ|Do, Dm, Z)|Do, θt)] (A.22)
132
APPENDIX A. EM A.2. INCOMPLETE DATA
where Do is the observed part of the dataset and Dm the missing part.
The complete likelihood
lc(θ|Do, Dm, Z) = ln [p(Do, Dm|Z, θ)p(Z|θ)] =
= ln p(Do, Dm|Z, θ) + ln p(Z|θ) (A.23)
is then made of two parts, A = p(Do, Dm|Z, θ) and B = p(Z|θ). Q can
the be maximized by considering each of the two terms separately
Q(θ, θt) = E[ln p(Do, Dm|Z, θ)] + E[ln p(Z|θ)] = QA + QB. (A.24)
We consider the first term and rewrite it by taking into account the
MAR assumption (see Section 3.2.1):
p(Do, Dm|Z, θ) = p(Do, |Z, θ)p(Dm|Z, θ) (A.25)
so QA becomes
QA(θ, θt) =N∑
n=1
M∑
m=1
E[Znm ln p(xn|Zm, θ)]. (A.26)
Each expectation in the double summation of Equation A.26 can be rewrit-
ten as the sum of its observed and missing parts
E[Znm ln p(xn|Zm, θ)] =∑
f∈obs (n)
Vf∑
x=1
E[Zm|Do, θt]δ(xnf , x) ln θmfx +
+∑
f∈mis(n)
Vf∑
x=1
E[Znmδ(xnf,x)|Do,θt]ln θmfx(A.27)
Where obs (n) is the set of observed variables for record xn and mis (n)
the set of missing ones. Note that in the second term xnf is unknown
and cannot be brought outside the expectation. We define honm the first
expectation, i.e. the expected value of Znm given the observed data and
133
A.2. INCOMPLETE DATA APPENDIX A. EM
θt, then
honm = E[Znm|D
o, θt] =p(xo
n|Znm = 1, θt)∑N
n=1 p(Znm = 1, θt)∑M
m=1 p(xon|Znm = 1, θt)
∑Nn=1 p(Znm = 1|θt)
(A.28)
which can be computed since we know all the parameters at step t and
because
p(Do|Z, θ) =N∏
n=1
∏
f∈obs (n)
Vf∏
x=1
θδ(xnf ,x)
m(n),f,x=
N∏
n=1
∏
f∈obs (n)
Vf∏
x=1
θδ(xnf ,x)Znm
mfx . (A.29)
The final formula for honm is then
honm =
αtm
∏f∈obs(n) θ
tmfxnj∑M
m=1 αtm
∏f∈obs(n) θ
tmjxnf
. (A.30)
For the second expectation, that corresponds to the missing part of the
dataset, we note that
p(Znmδ(xnf , x)|Do, θt) = p(Znm|Do, θt)p(δ(xnf , x)|Znm, θt) (A.31)
by Bayes formula and the independence of values of different features inside
each component. Then
E[Znmδ(xnf , x)|Do, θt] = E[Znm|Do, θt]E[δ(xnf , x)|Znm, θt]
= honmp(xnf = x|Znm, θt). (A.32)
Taking into account that
p(Dm|Z, θ) =N∏
n=1
∏
f∈mis (n)
Vf∏
x=1
θδ(xnf ,x)
m(n),f,x=
N∏
n=1
∏
f∈mis (n)
Vf∏
x=1
θδ(xnf ,x)Znm
mfx . (A.33)
and inserting Equation A.28 and A.32 into Equation A.27 we obtain the
explicit form of QA:
QA(θ, θt) =N∑
n=1
M∑
m=1
F∑
f=1
Vf∑
x=1
[θtmfx(1 − Inf) + δ(xnf , x)Inf
]ho
nm ln θmfx
134
APPENDIX A. EM A.2. INCOMPLETE DATA
where I is the indicator matrix, i.e., Inf = 1 if the variable f of record n is
observed, otherwise Inf = 0.
In order to maximize QA under the constraint that∑Vf
x=1 θmfx = 1
we use the Lagrange multipliers in the same way we did for the case of
complete data. It is straightforward to obtain
θt+1mfx =
∑Nn=1 ho
nm[θtmfx(1 − Inf) + δ(xnf , x)Inf ]
∑Nn=1 ho
nm
. (A.34)
We note that the maximization of QB is analogous to the case of com-
plete data, yielding an equivalent update equations of the α parameters:
αt+1m =
1
N
N∑
n=1
honm. (A.35)
135
Bibliography
[1] David Ahl. What to Do After You Hit Return or P.C.C.’s First Book
of Computer Games. Nowels Publications, Menlo Park, CA, 1975.
[2] Jose M. Bernardo. Expected information as expected utility. The
Annals of Statistics, 7(3):686–690, May 1979.
[3] Christopher M. Bishop. Pattern Recognition and Machine Learning
(Information Science and Statistics). Springer, August 2006.
[4] Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training
algorithm for optimal margin classifiers. In Computational Learing
Theory, pages 144–152, 1992.
[5] Miguel A. Carreira-Perpinan and Steve Renals. Practical identifiabil-
ity of finite mixtures of multivariate bernoulli distributions. Neural
Computation, 12(1):141–152, 2000.
[6] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review.
Statistical Science, 10:273–304, 1995.
[7] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with
statistical models. In G. Tesauro, D. Touretzky, and T. Leen, editors,
Advances in Neural Information Processing Systems, volume 7, pages
705–712. MIT Press, 1995.
137
BIBLIOGRAPHY BIBLIOGRAPHY
[8] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. In-
troduction to Algorithms. MIT Press and McGraw-Hill, 1990.
[9] Thomas M. Cover and Joy A. Thomas. Elements of Information The-
ory. Wiley-Interscience, New York, 1991.
[10] Herald Cramer. Mathematical Methods of Statistics. Princeton Uni-
versity Press, Princeton, NJ, 1946.
[11] Pierre-Simon de Laplace. Philosophical essay on probabilities.
Springer-Verlag, New York, 1995. Translated by A.I. Dale from the
fifth French edition, 1825.
[12] F. Demichelis, A. Sboner, M. Barbareschi, and R. Dell’Anna. Tma-
boost: an integrated system for comprehensive management of tissue
microarray data. IEEE Trans Inf Technol Biomed, 10(1):19–27, 2006.
[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood
estimation from incomplete data via the em algorithm (with discus-
sion). Journal of the Royal Statistical Society Series B, 39:1–38, 1977.
[14] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Clas-
sification. Wiley-Interscience, October 2000.
[15] E. Engvall and P. Perlman. Enzyme-linked immunosorbent assay
(elisa). quantitative assay of immunoglobulin g. Immunochemistry,
8(9):871–874, 1971.
[16] Pukelsheim F. Optimal Design of Experiments. John Wiley and Sons,
New York, 1993.
[17] V. V. Fedorov. Theory of optimal experiments. Academic Press, New
York, 1972.
138
BIBLIOGRAPHY BIBLIOGRAPHY
[18] I. Ford, C.P. Kitsos, and D.M. Titterington. Recent advances in
nonlinear experimental design. Technometrics, 31(1):49–60, February
1989.
[19] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from
incomplete data via an EM approach. In Jack D. Cowan, Gerald
Tesauro, and Joshua Alspector, editors, Advances in Neural Informa-
tion Processing Systems, volume 6, pages 120–127. Morgan Kaufmann
Publishers, Inc., 1994.
[20] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Margin based
feature selection - theory and algorithms. In Proceedings of the
twenty-first international conference on Machine learning (ICML-04),
page 43, New York, NY, USA, 2004. ACM Press.
[21] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P.
Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D.
Bloomfield, and E.S. Lander. Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring. Science,
286(5439):531–537, 1999.
[22] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, edi-
tors. Feature Extraction, Foundations and Applications. Series Studies
in Fuzziness and Soft Computing. Springer, 2006.
[23] Mats Gyllenberg, Timo Koski, Edwin Reilink, and Martin Verlaan.
Non-uniqueness in probabilistic numerical identification of bacteria.
Journal of Applied Probability, 31(2):542–548, June 1994.
[24] Harold Jeffries. Thoeory of probability. Clarendon Press, Oxford, 1948.
[25] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source
scientific tools for Python, 2001–. http //www.scipy.org/.
139
BIBLIOGRAPHY BIBLIOGRAPHY
[26] Kenji Kira and Larry A. Rendell. A practical approach to feature
selection. In Proceedings of the Ninth International Workshop on Ma-
chine Learning (ML-92), pages 249–256, San Francisco, CA, USA,
1992. Morgan Kaufmann Publishers Inc.
[27] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection.
Artificial Intelligence, 97(1-2):273–324, 1997.
[28] J. Kononen, L.Bubendorf, A.Kallioniemi, M.Barlund, P.Schraml,
S.Leighton, J.Torhorst, M.Mihatsch, G.Seuter, and O.P.Kallioniemi.
Tissue microarrays for high-throughput molecular profiling of tumor
specimens. Nature Medicine, 4(7):844–847, 1998.
[29] K.Pelckmans, J.De Brabanter, J.A.K.Suykens, and B.De Moor. Han-
dling missing values in support vector machine classifiers. Neural Net-
works, 18:684–692, 2005.
[30] Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin,
Mario A. T. Figueiredo, and Alexander J. Hartemink. Active learn-
ing of features and labels. In Proceedings of the 22nd International
Conference on Machine Learning (ICML-05), pages 43–50, 2005.
[31] D.V. Lindley. On a measure of the information provided by an experi-
ment. The Annals of Mathematical Statistics, 27(4):986–1005, Decem-
ber 1956.
[32] D.V. Lindley. Bayesian Statistics - A Review. SIAM, Philadelphia,
1972.
[33] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing
Data. John Wiley and Sons, New York, 1987.
140
BIBLIOGRAPHY BIBLIOGRAPHY
[34] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-
bayes classifiers. In Proceedings of the 19th Annual Conference on
Uncertainty in Artificial Intelligence (UAI-03), pages 378–385, 2003.
[35] D. J. C. MacKay. Information-based objective functions for active
data selection. Neural Computation, 4(4):590–604, 1992.
[36] J.M. Marin, K. Mengersen, and C.P. Robert. Bayesian modelling and
inference on mixtures of distributions, volume 25 of Handbook of Statis-
tics (D. Dey and C.R. Rao eds.). Elsevier-Sciences, November 2005.
[37] Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley,
1st edition, October 2000.
[38] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Ray-
mond Mooney. Active feature-value acquisition for classifier induc-
tion. In Proceedings of the Fourth IEEE International Conference on
Data Mining (ICDM’04), pages 483–486, Washington, DC, USA, 2004.
IEEE Computer Society.
[39] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Ray-
mond Mooney. Economical active feature-value acquisition through
expected utility estimation. In Proceedings of the First International
Workshop on Utility-Based Data Mining (KDD 2005), pages 10–16,
Chicago (IL), USA, August 2005.
[40] Donald Michie. Memo functions and machine learning. Nature,
(218):19–22, 1968.
[41] Todd K. Moon and Wynn C. Stirling. Mathematical Methods and
Algorithms for Signal Processing. Prentice Hall, Inc., 2000.
141
BIBLIOGRAPHY BIBLIOGRAPHY
[42] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI reposi-
tory of machine learning databases, 1998. http://www.ics.uci.edu/
~mlearn/MLRepository.html.
[43] Peter Norvig. Techniques for automatic memoization with applica-
tions to context-free parsing. Computational Linguistics, 17(1):91–98,
March 1991.
[44] T. E. Oliphant. Python for scientific computing. Computing in Science
& Engineering, 9(3):10–20, 2007.
[45] Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani.
Computational Methods of Feature Selection, chapter 5, pages 89–107.
Chapman & Hall / CRC, 2008.
[46] Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always
good turing: Asymptotically optimal probability estimation. Science,
302(5644):427–431, October 2003.
[47] P.Melville and R.Mooney. Diverse ensamble for active learning. In
Proceedings of the 21st International Conference on Machine Learning
(ICML-2004), pages 584–591, 2004.
[48] J.L. Schafer. Analysis of Incompete Multivariate Data. Number 72
in Monographs on Statistics and Applied Probability. Chapman &
Hall/CRC, 1997.
[49] B. Scholkopf and A.J. Smola. Learning with Kernels. MIT Press,
Cambridge, MA, USA, 2002.
[50] P. Sebastiani and H. P. Wynn. Maximum entropy sampling and opti-
mal Bayesian experimental design. Journal of Royal Statistical Society,
pages 145–157, 2000.
142
BIBLIOGRAPHY BIBLIOGRAPHY
[51] E. Seemuller. Apple proliferation. In Compendium of apple and pear
diseases, pages 67–68. American Phytopathological Society, St. Paul,
Minnesota, USA, 1990.
[52] E. Seemuller and B.C. Kirkpatrick. Detection of phytoplasma infec-
tions in plants. Molecular and Diagnostic Procedures in Mycoplasmol-
ogy, 2:299–311, 1996.
[53] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee.
In Proceedings of the fifth annual workshop on Computational learning
theory (COLT-92), pages 287–294, 1992.
[54] Victor S. Sheng and Charles X. Ling. Feature value acquisition in
testing: A sequential batch test algorithm. In Proceedings of the 23rd
international conference on Machine learning (ICML-06), pages 809–
816, New York, NY, USA, 2006. ACM Press.
[55] Yijun Sun. Iterative relief for feature weighting: Algorithms, theo-
ries, and applications. IEEE Trans. on Pattern Analysis and Machine
Intelligence (TPAMI), 29(6):1035–1051, June 2007.
[56] Kah Kay Sung and Partha Niyogi. Active learning for function approx-
imation. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances
in Neural Information Processing Systems, volume 7, pages 593–600.
The MIT Press, 1995.
[57] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, J. Cheng,
K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher, R. Hamann,
K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski,
T. Mitchell, P. Pachowicz, Y. Reich H. Vafaie, W. Van de Welde,
W. Wenzel, J. Wnek, and J. Zhang. The monk’s problems - a perfor-
mance comparison of different learning algorithms. Technical Report
CS-CMU-91-197, Carnegie Mellon University, December 1991.
143
BIBLIOGRAPHY BIBLIOGRAPHY
[58] S. Tong and D. Koller. Support vector machine active learning with
applications to text classification. In Proceedings of the Seventeenth In-
ternational Conference on Machine Learning (ICML-00), pages 999–
1006, 2000.
[59] S. Veeramachaneni, F. Demichelis, E. Olivetti, and P. Avesani. Active
sampling for knowledge discovery from biomedical data. In Proceeding
of the 9th European Conference on Principles and Practice of Knowl-
edge Discovery in databeses (PKDD-05), pages 343–354, 2005.
[60] Sriharsha Veeramachaneni, Emanuele Olivetti, and Paolo Avesani. Ac-
tive sampling for detecting irrelevant features. In Proceedings of the
23rd international conference on Machine learning (ICML-06), pages
961–968, New York, NY, USA, 2006. ACM Press.
[61] Wikipedia. Number (game), 2007. http://en.wikipedia.org/wiki/
Number.
[62] David Williams. Classification and Data Acquisition with Incomplete
Data. PhD thesis, Duke University, Department of Electrical and
Computer Engineering, May 2006.
[63] David Williams, Xuejun Liao, Balaji Krishnapuram, Ya Xue, and
Lawrence Carin. On incomplete-data classification. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 29(3):427–436,
March 2007.
[64] Bianca Zadrozny. Learning and evaluating classifiers under sample
selection bias. In C.E.Brodley, editor, Proceedings of the 21st Interna-
tional Conference on Machine Learning (ICML-2004), pages 114–121,
2004.
144
BIBLIOGRAPHY BIBLIOGRAPHY
[65] Z. Zheng and B. Padmanabhan. On active learning for data acquisi-
tion. In Proceedings of the International Conference on Datamining
(ICDM-02), pages 562–570, 2002.
[66] Xingquan Zhu and Xingdong Wu. Data acquisition with active and
impact-sensitive instance selection. In Proceedings of the 16th Inter-
national Conference on Tools with Artificial Intelligence (ICTAI-04),
pages 721–726, 2004.
145