disi - university of trento sampling strategies for...

PhD Dissertation

International Doctorate School in Information andCommunication Technologies

DISI - University of Trento

Sampling Strategies for Expensive Data

Emanuele Olivetti

Advisor:

Paolo Avesani

Co-Advisor:

Sriharsha Veeramachaneni

March 2008

Abstract

In domains where data collection can be guided in order to reduce the

collection costs, many issues arises like: which sampling strategy to adopt,

how to estimate quantities in presence of missing values (a byproduct of

incremental data collection), how to deal with sampling bias and many

others.

We present two kind of results on this topic. First a principled crite-

rion and algorithm to select which missing data to collect that are more

likely to provide useful information for estimating a given target concept.

The criterion can be intuitively summarized as “sample where maximum

estimated change is expected”. Second a set of examples and applica-

tions that implements that criterion in practical problems under different

assumptions. The applications presented focus on the Machine Learning

problem of feature relevance estimation interleaved with incremental data

collection.

We show experimentally the efficacy of the proposed criterion and its

implementations mainly on two data collection tasks. In the first we esti-

mate the change of accuracy in prediction when adding one new variable to

a class labelled dataset. In the second we extend the problem from one to

many new competing variables and their concurrent relevance estimation.

The experiments are performed on various datasets: pseudo-randomly gen-

erated and benchmarks. Morevoer the first task is studied on two real life

problems that motivated our interested in this research topic. The first

is about efficient assessment of new variables describing a disease in the

agricultural domain and the second is about estimation of the importance

of biochemical markers for cancer prediction. Comparison through experi-

ments against random sampling or other state-of-the-art sampling methods

demonstrates the superiority of the proposed solution.

Keywords

active learning, incremental sampling, Bayesian experimental design, miss-

ing data, feature relevance estimation.

4

Acknowledgment

I am deeply grateful to my wife Laura and my daughter Nausicaa for

their wonderful support during these three years. We shared many joys

and troubles together. But I am in great debt with them for the very

large amount of time I spent working and subtracted from their care. This

achievement is theirs as much as it is mine.

I wish to thank my friend Sriharsha Veeramachaneni, with whom I made

this journey. I walked with him along the hard paths of this research. His

amazing mathematical skills and intellectual honesty always drove us to

the right direction. I am grateful to Paolo Avesani for his wise leadership

during the six years I worked in his group. He gave me the opportunity

to work in scientific research. Together we glided along many challenges.

And now we are ready for the next one.

I would like to thank Prof. George Nagy for his direct and indirect help.

His vast experience, intuition and commitment to the scientific research are

of great inspiration to me. I am grateful to Prof. Donato Malerba and Prof.

Enrico Blanzieri for evaluating this work.

Many other people had influence on me and my work, but some of them

more than others. I wish to thank them. First my parents Franco and

Lina, who always encouraged me in pursuing my inclinations. Second my

aikido sensei Donatella Lagorio, who taught me the solution to almost all

problems: taking one step forward.

This work is dedicated to the memory of Elisabetta Vindimian. You

were with us when we began, and all these were just seeds. Now they

blossom into beautiful flowers.

Contents

1 Introduction 1

1.1 Sampling Expensive Data . . . . . . . . . . . . . . . . . . 2

1.1.1 Introductory Example . . . . . . . . . . . . . . . . 2

1.1.2 Extended Example . . . . . . . . . . . . . . . . . . 6

1.1.3 Abstraction . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Agricultural Domain . . . . . . . . . . . . . . . . . 10

1.2.2 Biomedical Domain . . . . . . . . . . . . . . . . . . 11

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Related Work 15

2.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . 15

2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Active Feature Sampling . . . . . . . . . . . . . . . . . . . 17

2.4 Missing Data and Selection Bias . . . . . . . . . . . . . . . 20

2.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Problem and Algorithm 23

3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 26

i

3.2.1 Missing Data . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Estimation Theory . . . . . . . . . . . . . . . . . . 29

3.2.3 Bayesian Exp. Design . . . . . . . . . . . . . . . . 30

3.3 MAC: Derivation . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 MAC: Information Gain . . . . . . . . . . . . . . . 35

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 MAC: implementation 39

4.1 Restrictive Assumptions . . . . . . . . . . . . . . . . . . . 40

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Example 1: Number . . . . . . . . . . . . . . . . . . . . . 43

4.4 Example 2: Cond.Prob. . . . . . . . . . . . . . . . . . . . . 45

4.4.1 Formal Description of the Urn Experiment . . . . . 46

4.4.2 MAC sampling algorithm . . . . . . . . . . . . . . 46

4.4.3 Explicit Benefit Function . . . . . . . . . . . . . . . 47

4.5 Application 1 . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 51

4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . 54

4.5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . 56

4.6 Application 2 . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.2 Implementation . . . . . . . . . . . . . . . . . . . . 59

4.6.3 Mixture model . . . . . . . . . . . . . . . . . . . . 61

4.6.4 Class-Conditional Mixture of Product Distributions 62

4.6.5 Feature Relevances . . . . . . . . . . . . . . . . . . 63

4.6.6 Conditional Prob. . . . . . . . . . . . . . . . . . . . 64

4.6.7 Parameter Estimation . . . . . . . . . . . . . . . . 65

4.6.8 Comparison with Application 1 . . . . . . . . . . . 68

4.7 Computational Issues . . . . . . . . . . . . . . . . . . . . . 69

ii

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Experiments 73

5.1 Conditional Prob. . . . . . . . . . . . . . . . . . . . . . . . 75

5.1.1 Detailed Results . . . . . . . . . . . . . . . . . . . . 77

5.1.2 Summary of Results . . . . . . . . . . . . . . . . . 82

5.2 Single Rel.Est. . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . 88

5.2.2 UCI Benchmark Data . . . . . . . . . . . . . . . . 89

5.2.3 Data from Agriculture Domain . . . . . . . . . . . 92

5.2.4 Data from Biomedical Domain . . . . . . . . . . . . 100

5.3 Multiple Feat.Rel. . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . 110

5.3.2 UCI Datasets . . . . . . . . . . . . . . . . . . . . . 111

5.3.3 Computational Complexity Issues . . . . . . . . . . 113

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6 Conclusions and Future Work 121

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.3 New Directions . . . . . . . . . . . . . . . . . . . . . . . . 125

A EM 129

A.1 Complete Data . . . . . . . . . . . . . . . . . . . . . . . . 129

A.2 Incomplete Data . . . . . . . . . . . . . . . . . . . . . . . 132

Bibliography 137

iii

List of Tables

4.1 Conditional probabilities P (X2 = x2|X1 = x1) parametrized

by a and b . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Joint probability P (X1 = x1, X2 = x2) parametrized by a, b

and c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Binary counts: nij is the number of observations for which

X1 = i and X2 = j. . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Brief description of the six groups of results. In the first

columns it is shown the number of the group. In the fol-

lowing columns under a, b and c it is shown for each group

whether the value of the parameters is different, equal/close,

greater or lesser than 1/2. . . . . . . . . . . . . . . . . . . 76

5.2 Summary of the results about the comparison of MAC sam-

pling algorithm vs. random policy over 125 generated datasets. 86

5.3 Agriculture Data. The number of samples required (out

of a total of 520) for the rms difference between the true

and estimated error rate to be less than 0.005 for various

sampling algorithms. Each row corresponds to one selection

of the previous feature X and the candidate feature X. The

rms values are computed over 1000 runs. . . . . . . . . . . 100

5.4 Features describing biomarkers data. . . . . . . . . . . . . 101

5.5 Configurations of experiments on biomarkers data. . . . . . 102

v

List of Figures

1.1 Introductory example: class-conditional distributions of tem-

perature and chemical tests. . . . . . . . . . . . . . . . . . 4

1.2 Introductory example: description of the active sampling

process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Extended example: description of the active sampling process. 9

5.1 Average root mean square difference between true and esti-

mated conditional probabilities. Results for Group 1 and

2 comparing MAC policy to random policy. . . . . . . . . 78







5.4 Example 2: Plots of the average estimates of parameters

a and b while sampling new data. MAC algorithm (•) is

compared to random algorithm (+). Results of the case

a = 0.1, b = 0.9 and c = 0.9 (representing Group 1) are

plotted on top panel. Results for a = 0.1, b = 0.9 and

c = 0.5 (representing Group 2) are on bottom panel . . . 83

vii













5.7 True class-conditional (X, X1) and (X, X2) variables distri-

butions. The data points before and after measuring the

candidate features are also shown. . . . . . . . . . . . . . . 88

5.8 Synthetic Data. The root mean square difference between

the estimated and true error rate after the candidate feature

is added as a function of the number of samples probed for

all sampling policies. Rms difference is averaged over 1000

repetitions of the experiment. . . . . . . . . . . . . . . . . 90

5.9 Solar Flares dataset. The average root mean square differ-

ence between the estimated and true error rate after the

candidate feature is added as a function of the number of

samples probed for all sampling policies. Rms difference is

averaged over 100 repetitions of the experiment. . . . . . . 93

viii

5.10 Balance Scale dataset. The average root mean square dif-

ference between the estimated and true error rate after the



averaged over 100 repetitions of the experiment. Due to the

large size of the feature space only the first 100 samples are

acquired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.11 Solar Flares dataset. The average root mean square differ-




averaged over 100 repetitions of the experiment. . . . . . . 95

5.12 Breast Cancer Wisconsin dataset. The average root mean

square difference between the estimated and true error rate

after the candidate feature is added as a function of the

number of samples probed for all sampling policies. Rms

difference is averaged over 100 repetitions of the experiment.

Due to the large size of the feature space only the first 100

samples are acquired. . . . . . . . . . . . . . . . . . . . . . 96

5.13 Mushroom dataset. The average root mean square differ-




averaged over 100 repetitions of the experiment. Due to the

large size of the feature space only the first 100 samples are

acquired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

ix

5.14 Zoo dataset. The average root mean square difference be-

tween the estimated and true error rate after the candidate

feature is added as a function of the number of samples

probed for all sampling policies. Rms difference is averaged

over 100 repetitions of the experiment. . . . . . . . . . . . 98

5.15 Apple Proliferation dataset. The average root mean square

difference between the estimated and true error rate after

the candidate feature is added as a function of the number

of samples probed for all sampling policies. Rms difference

is averaged over 100 repetitions of the experiment. . . . . . 99

5.16 Biomarkers experiment I and II. The average root mean




difference is averaged over 500 repetitions of the experiment. 104

5.17 Biomarkers experiment I and II. The average root mean





5.18 Biomarkers experiment III and IV. The average root mean





x

5.19 Biomarkers experiment V and VI. The average root mean





5.20 Biomarkers experiment VII and VIII. The average root mean





5.21 Biomarkers experiment IX and X. The average root mean





5.22 Average rms differences between estimated and true rele-

vances at each sampling step on artificial data for random

and MAC policies. Average performed over 100 repetitions

of the experiment. Only the first 100 sampling steps are

shown to improve readability. Note that true relevances are

those computed from the full dataset, and not the theoreti-

cal ones which are slightly different. . . . . . . . . . . . . . 112

5.23 Estimated relevances at each sampling step for every single

feature on artificial data. Random (dashed line) and MAC

(solid-dotted line) policies are compared. Since there are

three features and 200 instances the x axis goes to 600. . 113

xi

5.24 Average cumulative sampling counts at each sampling step

for each feature on artificial data. The more relevant fea-

tures are sampled more frequently than less relevant features

in case of MAC policy. As a comparison, the random policy

samples features independently of their relevance. . . . . . 114

5.25 The normalized difference between final relevances and esti-

mated relevances at each sampling step is plotted for random

and MAC policies on Zoo dataset (top panel) and Monks

datasets (bottom panel). The value at step 0 (all feature

values unknown) is normalized to 1.0. . . . . . . . . . . . . 115

5.26 The normalized difference between final relevances and es-

timated relevances at each sampling step is plotted for ran-

dom and MAC policies on Solar Flares dataset (top panel)

and Cars dataset (bottom panel). The value at step 0 (all

feature values unknown) is normalized to 1.0. . . . . . . . 116

5.27 Average squared sum of the differences between estimated

and true relevances at each sampling step on artificial and

UCI Solar Flares datasets. Random and MAC policies are

compared to the active active policy that considers only a

small random subset of the missing entries at every sampling

step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xii

Chapter 1

Introduction

In data modeling and analysis it is commonly assumed that data is provided

in advance, and we do not control the sampling process. But in the class of

problems where sampling is expensive often this is not the case: we need to

select which data to collect or measure in order to understand the problem

under investigation while reducing the sampling cost as much as possible.

In this context we aim to have an efficient sampling method to trade off

the need of more information against the cost of obtaining it.

This work focuses on the class of problems where we can incrementally

select which data to collect at the next step with the target of learning a

given concept (i.e. a function to estimate from the data). In the proposed

solution, sampling is interleaved with learning on currently available data

at each step in order to infer which could be the best next sampling step.

The aim is to estimate the target concept efficiently as new samples are

added.

The applications presented in this dissertation are mainly devoted to

the learning problem of feature relevance estimation from a dataset of la-

beled instances, where some (or all) feature values are missing and have to

be collected incrementally. In this setting sampling is performed sequen-

tially, selecting the instances to monitor and collecting some of the missing

1

1.1. SAMPLING EXPENSIVE DATA CHAPTER 1. INTRODUCTION

information about them at each step.

Often, a side-effect of incremental data collection is the occurrence of

missing data patterns, since only partial information is disclosed at each

step. Handling missing data in learning tasks is itself a complex prob-

lem that can be addressed by different techniques. Even though this sub-

problem can be decoupled from the general problem of deciding where to

sample, we will manage both in all practical applications presented here.

In the following we introduce the core problem of this research using

an intuitive example and then generalize it to a much broader scope. We

then illustrate the motivations that support this work and lead us to study

the sampling strategies for expensive data. An exact statement of the

goals of this dissertation together with a schema of the organization of the

remaining chapters will follow.

1.1 Sampling Expensive Data

We illustrate the problem of incrementally sampling expensive data with

a limited budget. In this context we aim at deciding carefully what to

sample at the next step based on the current information. We begin with

an introductory example related to the medical domain to help the reader

get acquainted with the general picture. We then generalize the problem

to a broader scope (even though this work addresses just a part of it).

Note that many details in the example are deliberately omitted. At this

point of the presentation they are inessential to the comprehension of the

problem.

1.1.1 Introductory Example

Assume that we have a method to predict (detect) the presence of a disease

in patients, based just on a single variable: the temperature of the body.

2

CHAPTER 1. INTRODUCTION 1.1. SAMPLING EXPENSIVE DATA

Since the accuracy of the prediction is poor, we want to conduct a medi-

cal study to improve it using a more complex model, based on one more

variable. We aim to add a new variable that together with the tempera-

ture predicts better the presence of the disease. Assume that we have two

new candidate variables: the two new variables describe the result of two

different chemical tests performed on the patient. We denote the chemical

tests as CT 1 and CT 2. The amount of information each of them provides

for improving the prediction of the presence of the disease is initially un-

known. Estimating the amounts of these improvements is the final target

of the study. It will allow us to decide which of the two test should be

added to the new medical protocol to assess the presence of the disease.

For this medical study a given number of subjects is monitored and

their temperatures together with the presence or absence of the disease

are known in advance. Assume that the cost of performing each chemical

test on a patient is high, compared to the limited budget available. The

problem is to carefully select on which patients to perform which chemical

tests. The choice, in principle, should take into account all the information

known in advance characterizing each patient (her temperature and the

presence of the disease, in this case), since this is relevant information for

the final goal of improving the quality of the predictions.

We assume that both chemical tests have the same cost and the budgets

associated with each are equal. We will perform CT 1 and CT 2 the same

number of times. The process of testing is assumed to be incremental.

Whenever we test CT 1 and measure its outcome on a patient, we can

exploit this new information to select the patient on which perform the

next test. In parallel and independently we perform this same process

with CT 2. At the end we will have the two (possibly different) sets of

patients on which the results of the first and second tests are known. The

size of the two sets will be the same.

3


Assume that the disease is known to appear only in people whose tem-

perature is above 37C and always when it is above 39C. We can group

the temperatures of patients into three groups: below 37C (group I), be-

tween 37C and 39C(group II) and above 39C (group III), as shown in

Figure 1.1. We assume that the outcome of the chemical tests can be

discretized: CT 1 at two levels (high, low) and CT 2 at three levels (low,

medium and high). Assume that the underlying joint distribution of the

variable temperature and one chemical test is depicted in Figure 1.1, mean-

ing that one of the two test can improve the prediction and the other is use-

less. Since the joint distribution is not known before performing tests, the

goal is to collect incrementally the values of the tests on selected patients

in order to efficiently estimate the prediction error (i.e. the overlapping

areas in figure 1.1).

abscence

I II

1 2CT CT

T

TT

T

presence

abscence

presence

I II

presenceabscence abscence presence

Med

Lo

Hi

Lo

Hi

I II III I II III

IIIIII

Figure 1.1: Class-conditional distributions of the values of the temperature (T ) across the

three groups (below). Class-conditional joint distributions of temperature and chemical

tests (above) showing that CT 1 is a useless test and CT 2 is useful. Absence of the disease

is denoted by ×, presence by ©.

A naıve approach to the sampling problem can be to disregard all infor-

mation known in advance, and just randomly sample some patients with

4


one of the two chemical tests. We call this approach random sampling

policy. Another strategy could sample the same number of patients having

the disease and not having it. A cleverer strategy could note that there is

no gain in sampling patients in group I or group III, since in those cases

the temperature by itself is enough to decide the presence or absence of

the disease. Many other ad hoc improvements can be implemented with

similar ideas.

The method proposed in this research exploits the available information

in a principled way. To summarize our main result in the context of this

example, we prove (under mild assumptions) that given a chemical test,

the optimal strategy is to select the patient from whom we expect the

biggest expected change with respect to the current estimates of the error

rate of the prediction method. In other words, we estimate the expected

change in the predicted error rate when we perform that chemical test on

a given patient. Inside the expectation we do not take into account just

one possible outcome of the test (e.g. the most likely), but all possible

results weighted by their probability of occurrence. Those probabilities are

estimated from the data already acquired.

From the point of view of the experimental process, after choosing one

chemical test, we compute the expected change in error rate for each pa-

tient not tested so far and select the one having the highest expected

change. On this patient then we perform the actual chemical test and add

the outcome as a new value to the current dataset. Then we update the

current error rate estimate. At the next step we will recompute again the

expected change for every patient left (one fewer than before). The process

is described in Figure 1.2.

At the end of the experimental session, when the budget is exhausted,

we perform the same process on the other chemical test. After that we can

compare the reduction of error rates by both tests and decide to adopt the

5


best one in the new improved protocol to detect the disease. A possible

outcome of that comparison could be that both new candidate chemical

tests are discarded because the estimated improvement in accuracy is too

little in both cases, compared to the costs of including either.

1.1.2 Extended Example

Based on the previous example we illustrate an extended version [45]. As-

sume now that each patient can be described by many variables: temper-

ature, weight, age, marital status, result of chemical test CT 1, of CT 2,

etc. but we know the values of only some (or none) of them. This means

that the initial dataset, where each record describes a patient along with

those variables, has some (or no) known values. We want to assess the

importance of each variable for predicting the presence of the disease and

we have a limited budget to spend for measuring the missing values in

the dataset. Differently from the previous example, this time we want to

allocate the total budget in a more complex way, deciding at each step for

which patient and for which variable to acquire a new value. This new set-

ting lets the variables compete one against the other in the allocation. In

this new example it could happen that some variables are measured more

times than others. In this case the sampling strategies to be applied for an

efficient estimation of the target concept appear to be more complex, due

to the possible correlations between variables.

As in the first example, after a value of the variable is measured (e.g.

CT 1 is performed on patient 35) and its result is added to the dataset,

we exploit the result together with all the information available up to

then to decide which variable to extract on which patient next. At the

end of experimental session we will have more (or all) entries known and

make the final assessment of the importance of each variable. The process

is described in Figure 1.3, where the feature relevance is defined as the

6


error rate = 0.31

error rate = 0.12

error rate = 0.26

StoppingCriterion

QueryActualValue

Exit

StoppingCriterion

Exit

StoppingCriterion

Exit

Select TopBenefit

Error RateEstimate

?

Estimate Expected Benefits

Top Expected Benefit:

= argmax(B(i))

B(2) = .05

i *

O

C T 2

1

2

3

4

X

X

O

O

II

II

CT

5 II

.01

.00

.05

.04

I

III

C T

1

2

3

4

II

?

?

II

CT

5 II

2

O

O

O

X

X I

III

?

Med

Med

O

C T 2

1

2

3

4

X

X

O

O

II

?

?

II

CT

5 II

I

III

Med

Lo

?

O

C T 2

1

2

3

4

X

X

O

O

II

?II

CT

5 II

I

III

Lo

Lo

Med

Hi

Figure 1.2: Active sampling process for error rate estimation. The entries shown in gray in

the dataset in the top left corner indicate where the value of chemical test CT2 is missing.

The right side shows the process of computing the benefit of sampling at a particular

missing value. The missing value with the highest benefit is chosen and actually sampled.

The process ends when the budget is over (Stopping Criterion).

7


mutual information between a variable and class labels (see Section 4.6.5).

In this example we introduce a cost model. We assume that each vari-

able has its own extraction cost (equal across all patients). As an example,

measuring the temperature of a patient has a much lower cost than per-

forming a chemical test. This means that in case of equal importance it

is desirable that temperature will be collected more frequently. We will

define precisely the cost model only later.

1.1.3 Abstraction: Class of Problems of Interest

The two examples discussed so far describe two scenarios that we actually

studied and implemented in this research. We can generalize them to a

class of problems, and define it by:

• A set of monitored instances on which we want to estimate a target

concept, e.g. a set of patients on which we study the importance of

two chemical test to predict the presence of a disease.

• A target concept to be estimated, in terms of point estimation1, e.g.

a classifier error rate, the relevance of a variable with respect to the

class label etc.

• Sampling constraints, defining the sample space where we have to

choose one element to sample at each step. In the first example we

allow one chemical test on one patient at a time, so at each step the

sample space was the set of single patients on whom the test was

not conducted until then. If we assume different constraints, e.g. we

have to do batches of 3 chemical tests at a time, then the sample space

becomes the set of all possible triplets of patients not yet tested. These

kinds of constraints are domain dependent.

1see Chapter 3 for motivations about this restriction to point estimates.

8


StoppingCriterion

1 0 2 0 1

C X1 X2 X3 X4

1

2

3

4

5

0

0

1

1

1 1

2 0

0

0 01

0.8 0.2

1.5

0.9 0.4

0.5 0.7 1.1

EstimateFeature Relevances

1 0 2 0 1

C X1 X2 X3 X4

1

2

3

4

5

0

0

1

1

1 1

2 0

0

0 01 ???

? ??? ?

QueryActualValue

Select TopBenefit

1 0 2 0 1

C X1 X2 X3 X4

1

2

3

4

5

0

0

1

1

1 1

2 0

0

0 01

??

? ??? ?

1

Exit

StoppingCriterion

Exit

StoppingCriterion

Exitg g g g

1 2 3 4

g g g g1 2 3 4

g g g g1 2 3 4

Estimate Expected Benefits

Top Expected Benefit:

= argmax(B(i,j))x23

B(2,3) = 1.5

1 0 2 0 1

C X1 X2 X3 X4

1

2

3

4

5

0

0

1

1

1 1

2 0

0

0 01

?

??

1

1

0

1 0

Feature Relevances:

0.010.480.55

Feature Relevances:

0.010.250.490.51

Feature Relevances:

0.470.530.060.28

0.11

Figure 1.3: Active sampling process for feature relevance estimation. The entries shown in

gray in the dataset in the top left corner indicate the ones that are missing at a particular

instance. The bottom right hand corner shows the process of computing the benefit of

sampling at a particular missing value. The missing value with the highest benefit is

chosen and actually sampled. 9

1.2. MOTIVATION CHAPTER 1. INTRODUCTION

This abstract problem will be addressed by proposing a single general

criterion. Some implementations of the criterion will be provided for a few

cases, where precise assumptions and constraints allow full derivations.

1.2 Motivation

In the following we present two practical problems in two domains of ap-

plication to motivate the need of scientific investigation of this research

topic. These two problems are related to the examples introduced ear-

lier. We will derive full implementations of the proposed method for these

problems later.

1.2.1 Agricultural Domain

Our interest in the topic of this research was aroused by a research project

(SMAP)2 in the domain of agriculture, dealing with the Apple Prolifera-

tion [51] disease in apple trees. Biologists monitor a distributed collection

of apple trees affected by the disease with the goal of determining the symp-

toms that indicate the presence of the disease causing phytoplasma. A data

archive is arranged with a finite set of records, each describing a single ap-

ple tree. All the instances are labeled as infected or not infected. Each year

the biologists propose new candidate features (e.g. color of leaves, altitude

of the tree, new chemical tests, etc.) that could be extracted (or measured)

to extend the archive, so as to arrive at more accurate models. Since the

data collection in the field can be very expensive or time consuming, a data

acquisition plan needs to be developed by selecting a subset of the most

relevant candidate features that are to be eventually acquired on all trees.

2This work was funded by Fondo Progetti PAT, SMAP (Scopazzi del Melo - Apple Proliferation), art.

9, Legge Provinciale 3/2000, DGP n. 1060 dd. 04/05/01.

10

CHAPTER 1. INTRODUCTION 1.2. MOTIVATION

1.2.2 Biomedical Domain

More recently we investigated a problem in cancer characterization. In the

biomedical domain, the acquisition of data is often expensive. The cost

constraints generally limit the amount of data that is available for analysis

and knowledge discovery.

In biological domains molecular tests, called biomarkers are sudied.

They are conducted on samples of tumor tissue. Biomarkers provide predic-

tive information to pre-existing clinical data. The development of molecu-

lar biomarkers for clinical application is a long process that must go through

many phases starting from early discovery phases to more formalized clin-

ical trials. This process involves the retrospective analysis of biological

samples obtained from a large population of patients. The biological sam-

ples that need to be properly preserved are collected together with corre-

sponding clinical data over time and are therefore very valuable. There is

therefore a need to carefully optimize the use of the samples while study-

ing new biomarkers. We address the issue of cost-constrained biomarker

evaluation for developing diagnostic/prognostic models for cancer.

In our problem new biomarkers are tested on biological samples from

patients who are labeled according to their disease and survival status.

Moreover for each patient we have additional information such as grade

of the disease, tumor dimensions and lymphonode status. That is, the

samples are class labeled as well as described by some previously observed

features. The goal is to choose the new feature (the biomarker) that is

most correlated with the class label given the previous features. Since the

cost involved in the evaluation of all the biomarkers on all the available

data is prohibitive, we need to actively choose the samples on which the

new features (biomarkers) are tested. Therefore our objective at every

sampling step is to choose the query (the sample on which the new feature

11

1.3. GOALS CHAPTER 1. INTRODUCTION

is measured) so as to learn the efficacy of a biomarker most accurately.

1.3 Goals

How to decide which data to collect next among multiple options, given

that we have already some (or none) of the data and we want to learn

a target concept? This work presents a general criterion to address this

kind of question and provides practical implementation when the concept

to learn is feature relevance.

In this research we will provide a formal description of the sampling

problem introduced so far, a general solution derived from the theory of

experimental design, a set of examples and applications on which the gen-

eral solution will be implemented, together with simulation experiments to

demonstrate its effectiveness.

Implementation and experimental evaluation of the method proposed in

this research have been done just in a few simpler cases. We focus more

on simple cases to allow deeper analysis of the problem. In particular, we

present implementations of:

• Learning a step function.

• Estimating the conditional probability between two binary variables.

• Estimating the importance of new candidate variables, one at a time,

when added to a class labelled dataset. The importance is defined as

the improvement in the error rate of a maximum a posteriori classifier

built on all the available data.

• Estimating concurrently the importance of new candidate variables,

when added to a class-labelled dataset. The importance is defined as

the mutual information between class labels and variables values.

12

CHAPTER 1. INTRODUCTION 1.4. ORGANIZATION

In all cases the proposed sampling method will require less samples to reach

the target concept, than other methods in common practice and proposed

in the literature.

1.4 Organization

The structure of this dissertation is as follows: in Chapter 2 we review the

main works in the many areas addressed by this research topic. In Chap-

ter 3 we define formally the generalized problem, introduce the necessary

background notions and derive the proposed method, together with an in-

tuitive interpretation related to information theory. In Chapter 4 we focus

on two examples and two applications and derive full implementation of

the proposed criterion. In Chapter 5 we presents the results of experiments

conducted on the examples and applications introduced earlier, to support

the theoretical achievements of this research. In Chapter 6, we summarize

the results and propose interesting directions for future research.

13

1.4. ORGANIZATION CHAPTER 1. INTRODUCTION

14

Chapter 2

Related Work

Data acquisition has traditionally been studied in statistics under the topic

of experimental design and in machine learning under the topic of active

learning. The main topic of this research deals with the acquisition of miss-

ing values in a partially filled dataset, constrained by limited resources and

incremental acquisition policy, to learn a given concept. We can call this

topic active feature sampling. Below a review of previous related work is

presented, starting from experimental design and machine learning. Since

the problem of active feature value acquisition triggers many issues, a brief

review of related topics is presented, on: handling missing data, sample

selection bias and features selection.

2.1 Experimental Design

The design of experiments involves decisions about all aspects controlling

an experiment before and during its life. Usually some information is

available in advance motivating the use of Bayesian methods, leading to

a branch of Bayesian statistics called Bayesian experimental design that

dates back to the ’70 [17]. Non-Bayesian approaches are present in litera-

ture but they are less popular. We don’t review non-Bayesian experimental

design literature since our research follow the Bayesian approach. A com-

15

2.2. ACTIVE LEARNING CHAPTER 2. RELATED WORK

prehensive discussion on non-Bayesian experimental design can be found

in [16, 18].

The basic idea of experimental design is that statistical inference about

the quantities of interest can be improved by appropriate selection of the

values of the control variables of the experiment. According to Lindley [32],

designing experiments in the Bayesian setting means defining a utility

(or benefit) function that reflects the purpose of the experiment and is

parametrized by data and control variables. The target is then to max-

imize this utility function. Many criteria (expressed as different utility

functions) have been proposed in experimental design literature, depend-

ing on the optimization target and leading to an alphabetical taxonomy

when applied to the normal linear regression model. Among them the most

popular (and relevant for this research) are:

• The expected gain in Shannon information [9] between prior and pos-

terior distribution (i.e. before sampling and after sampling). This

utility function leads to the Bayesian D-optimal design.

• The expected weighted distance between the true and expected es-

timates after sampling. This utility function leads to the Bayesian

A-optimal design.

A thorough review of Bayesian experimental design and its taxonomy for

linear and nonlinear models is [6]. For Bayes D-optimal design see [2, 17,

50].

2.2 Active Learning: Sampling Labels

The application of the theory of optimal experiments to machine learn-

ing leads to interesting problems related to implementation issues such

as finding good approximations to the theory and learning with sampling

16

CHAPTER 2. RELATED WORK 2.3. ACTIVE FEATURE SAMPLING

bias (the side effect of non i.i.d. sampling). Traditionally this area, called

active learning, is dominated by the problem of minimization of the num-

ber of class labels that has to be supplied for training. In [35] MacKay

studies sampling bias and expected informativeness of candidates for ac-

quiring class labels for neural networks and in [7] active learning is used

for learning Gaussian mixtures. A Bayesian formulation of active learning

for function approximation is presented in [56]. Seung et al. present the

query by committee algorithm and prove that actively selecting samples to

label can lead to exponentially faster decrease in generalization error than

random selection [53, 47]. Active learning has been successfully applied to

text classification using support vector machines [58]. In contrast to this

traditional definition of active learning we study the incremental acquisi-

tion of feature values which presents new implementation issues related to

learning with missing data and dealing with continuous features.

2.3 Active Feature Sampling

A new branch of active learning, different from the previous one, addresses

the problem of minimizing feature samples (instead of label samples) for

feature selection, classifier induction and related learning problems: this

branch is exactly the area of our research and inside it falls the problem

described in the introduction of the dissertation (see Chapter 1). Here the

class labels are all known in advance but some (or all) feature values are

unknown and a strategy has to be set up in order to acquire the values.

The main works that started to give solutions to this class of problems

address the question of inducing a classifier using active policies for feature

values acquisition. Even though the main target of this research is feature

evaluation (and not classifier induction), the comparison against that lit-

erature still makes sense since the core steps of the decision processes are

17

2.3. ACTIVE FEATURE SAMPLING CHAPTER 2. RELATED WORK

equivalent. Therefore some of the methods available in the literature can

be adapted to the feature evaluation problem.

Zheng and Padmanabhan in [65] analyze the problem of user modeling

and propose a goal-oriented data acquisition (GODA) policy independent

of the model and based on this idea: estimate the variance of imputed

(missing) data and select, as the next instance on which measure all missing

features, the one that has the highest variance. The underlying principle is

that probing where uncertainty is higher leads to better improvement of the

model. In their work Zheng and Padmanabhan lack a principled approach

to the measurement of the uncertainty. They take into account previously

known feature values only where data was fully collected, loosing valuable

information. In addition, feature extraction is performed on all missing

features each time without any consideration of different feature relevances.

In our proposed solution features are sampled independently on a base of

estimated relevance for improving the prediction. Other limitations of [65]

are the use of imputation as the only answer to the problem of missing

values and the absence of a cost model for different features.

Lizotte et al., in [34], present a method to select sequentially pairs of

instance-feature to be measured for classifier induction under cost-constraints.

They base their solution on a Naıve Bayes classifier and compare different

acquisition policies, they propose then a look-ahead strategy. This is a

multi-step policy that estimates the expected benefit of probing all possi-

ble sequences (of a given length) of available instance-feature pairs to find

out where it is most interesting to sample next, rather than using a myopic

(i.e., single step) method. One may criticize the choice of the classifier that

solves some computational issues but relies on strong assumptions (feature

independence). The look-ahead strategy is computationally heavy due to

the exponential explosion of the configuration space.

In [38, 39] Melville et al. presented the same problem of [34] with a dif-

18

CHAPTER 2. RELATED WORK 2.3. ACTIVE FEATURE SAMPLING

ferent approach, where instance-features are ranked at each step according

to the difference of their current probabilities of the two most likely classes.

This approach, called Error Sampling, is proposed together with the use

of decision trees as classifier (even though the method is general and not

bound to it) and a comparison is made against [65] showing an increase

in performance. What is missing from this work is a principled approach

to the solution of the problem even though the method is claimed to be

inspired by optimum experimental design. A model for different cost of

the features is absent.

In [30] Krishnapuram et al. propose a method for active acquisition of

features values and class labels in a setting where both of them are (par-

tially) missing and the goal is learning a classifier. The method implements

logistic regression via maximum likelihood estimation, independently on

each variable, and selects the next sample to measure by estimating the

mutual information between the classifier weight vector and each candidate

either in the case of a feature value or a class label. The motivation of an

information based criterion for active data acquisition is just referred to

the previously mentioned article about label acquisition by MacKay [35],

without justifying the different context (feature value acquisition). Other

objections are the missing cost model and the target of classification tasks

only.

Williams in [62] and Williams et al. [63] extend Krishnapuram’s ap-

proach proposing a framework to guide data collection (both on features

and class label) based on risk minimization. The benefit of sampling a

given missing entry or set of missing entries is computed as the difference

between the current expected risk of misclassification minus the expected

risk of misclassification after sampling. The classification step is imple-

mented using logistic regression and distributions are modeled as a Gaus-

sian mixture model. An extension to kernel methods [4, 49] is provided to

19

2.4. MISSING DATA AND SELECTION BIAS CHAPTER 2. RELATED WORK

take into account non-linear classification tasks. An extension to the semi-

supervised case is provided as well. A main limitation of this approach is

that the target is restricted to classification: it is unclear how to extend it

to other learning tasks since the method is intimately based on classifica-

tion risk. Moreover Williams claims that the goal of active data acquisition

ends when the classifier is learned, not taking into account the need for ac-

tive acquisition during deployment of the classifier. A further limitation is

the requirement that misclassification costs and extraction costs must be

specified in the same units.

To the best of our knowledge, few other works are related to this exact

topic. They focus on different aspects of the problem (see [54], on active

feature acquisition for testing) or are comprised in the literature discussed

so far (see for example [66]).

2.4 Missing Data and Sample Selection Bias

The study of active feature sampling must address the problem of handling

missing values in datasets when making inferences. At every sampling step

the dataset has missing values corresponding to what has not yet been

acquired. Even though some classifiers are able to deal with that problem

(e.g. Bayesian MAP, decision trees, etc.) the common solution is to fill-in

(impute) the missing values in some way. The main reference in this area

is the book of Little and Rubin [33] which proposes a simple schema to

discriminate situations in which data is missing at random or not. This

difference affects the solution of the inference problem. The book describes

many methods to perform imputation as well. More recent work [29, 63]

presents more complex methods for specific classifiers.

A byproduct of active sampling policies is that sampling does not follow

the underlying distributions of the data. This introduces a bias which can

20

CHAPTER 2. RELATED WORK 2.5. FEATURE SELECTION

affect strongly all inferences. In the machine learning area the concept

of sample selection bias has been discussed by Zadrozny et al. [64]: the

problem of bias during selection of samples is analyzed formally for many

popular learning methods (Bayesian, logistic, decision tree, SVM) which is

a relevant issue to active sampling. Along with demonstrating the effect

of sample selection bias on most of them, this article shows how to take it

into account and correct the estimation.

2.5 Feature Selection

Although one of the goals of active feature value acquisition might be

feature selection we emphasize that the focus of our research is data acqui-

sition. Traditionally in the feature selection domain data is already fully

collected over all instances and features. Those features are then ranked

or grouped based on the predictive power they have on class labels. Com-

prehensive surveys on feature selection are [27, 22]. Since the core problem

of acquiring data incrementally is not addressed we do not review research

in this area.

2.6 Summary

In this chapter we introduced the literature of the main scientific areas in-

volved or related to this research. The problem of guiding data acquisition

has traditionally been studied in statistics, under the topic of experimen-

tal design. The Bayesian approach has been called Bayesian experimental

design. A list of references has been provided.

In the area of machine learning, a related problem (i.e., which records

of a dataset to acquire labels for efficient classification) has been studied

under the name of Active Learning. In this research we are interested in

21

2.6. SUMMARY CHAPTER 2. RELATED WORK

querying the variables describing records, instead of only class labels. This

problem has been studied only recently (and by few authors), under the

name of Active Feature Sampling. We discussed all the relevant works in

this area evidencing the limits of current approaches.

Other related topics are touched by our research: handling missing data,

sample selection bias and features selection. We reviewed them briefly and

gave pointers to the main literature.

22

Chapter 3

Problem Statement and Sampling

Algorithm

In this Chapter we give a detailed definition of the problem under investiga-

tion and introduce the necessary notation together with some background

knowledge. Then we derive one of the main results of this research: a gen-

eral sampling algorithm to decide where to sample among the missing data.

This algorithm will rank each candidate to be sampled, thus providing to

the experimenter the necessary information to select the most promising

one. This best candidate is expected to yield the highest contribution to

improve the estimation of the target concept, under mild assumptions.

The basic idea of the proposed algorithm, called Maximum Average

Change (MAC) sampling algorithm, is to assess the changes in the value

of the target concept between the current estimate and the expected es-

timates obtained by measuring each candidate missing entry. The change

is computed as an expectation over all possible values that the candidate

missing entry can assume. Then, the candidate yielding the highest change

is actually sampled.

The proposed algorithm is derived from an optimality criterion from

Bayesian experimental design, in a form that is convenient for incremental

sampling problems. At the end of the Chapter, an intuitive interpreta-

23

3.1. DEFINITIONS CHAPTER 3. PROBLEM AND ALGORITHM

tion of the sampling algorithm is given in terms of information theory.

The reader is assumed to be familiar with basic concepts of mathematical

statistics and estimation theory.

3.1 Definitions and Notation

• Pattern instance s: a pattern instance is a monitored subject under

investigation. Example: a patient on which we conduct the medical

study.

• Instance space S: the set of all possible pattern instances. Example:

the set of all patients in a medical study.

• Sample space (or observation space), Ω: the set of all possible out-

comes of an experiment, a measurement, or a random trial. Example:

in a medical study the sample space can be the set comprising two

outcomes: presence and absence of the disease under investigation.

• Variable or Feature space, X : the set of distinct numerical values

representing all possible outcomes of an observation/experiment. Ex-

ample: the absence or presence of a disease can be represented using

0, 1. In the following we will use the terms “variable” and “feature”

with the same meaning.

• Variable (or Feature), X: a random variable describing a specific prop-

erty of the pattern instance. It maps the outcome of an experiment

to a given value, X : Ω → X . Example: in the experiment of finding

the presence of a disease in a patient, X returns 0 in case of absence

and 1 in case of presence.

• Dataset D: a dataset is a set of records describing pattern instances.

Each record corresponds to an ordered set of values describing a fea-

24

CHAPTER 3. PROBLEM AND ALGORITHM 3.1. DEFINITIONS

ture/variable value for a given pattern instance. We can represent

the dataset as a N × F matrix D, where N is the number of pattern

instances under observation and F is the number of features/variables

we can collect for each pattern instance. A dataset can have missing

values, meaning that some entries of the matrix can be unknown. Dur-

ing the process of incremental sampling we will call Dk the dataset at

sampling step k (before sampling). We assume the rows of the dataset

to be independently and identically distributed.

• Parameter space Θ: the set of all possible parameter values of the

probability distribution of a random variable. Example: for a random

variable following the Bernoulli distribution on parameter p, Θ =

[0, 1].

• Decision rule, strategy or decision function, φ: it is a function that

states which decision (e.g. estimate) to take when the experiment

yields a given outcome, φ : X → ∆. Example: consider the exper-

iment that generates N independent measures of a quantity as out-

come; then the arithmetic mean is a strategy that associates a decision

(i.e. an estimate) with the outcome of the experiment.

• Concept G: a concept learnt (or estimated) from currently available

data is a random vector function G : D → RQ. Example 1: the error

rate of a given classifier on the data collected (Q = 1). Example 2:

the vector of relevances (see Section 1.1.2) of the variables described

by records (Q = F ).

• Missing and observed entries, Dm and Do: in a dataset D we use

the superscript m (i.e. Dm) to denote the missing entries, and the

superscript o (i.e. Do) for the observed ones.

25

3.2. PROBLEM STATEMENT CHAPTER 3. PROBLEM AND ALGORITHM

• Set of candidates (or designs), Z = Z1, ...,ZM: in a dataset with

missing entries, it is the set of all possible candidates to be measured

at the current sampling step. Each candidate is a subset of the current

missing entries, defined by domain constraints. Example 1: if we are

allowed to measure only one missing entry at a time, then Zi is just

the i-th missing entry and Z is equal to the set of current missing

entries Dm. Example 2: if we can only measure batches of three

missing entries at a time, then Z is the set of(Nm

3

)triplets made from

the current Nm missing values. Example 3: if we can access each

instance only once , then we have to measure all missing entries of

that instance at the same time; in this case Z is the set of instances

having at least one missing value. In general Z ⊂ P(Dm), where

P(Dm) is the powerset of the missing entries.

3.2 Incremental Feature Sampling for Learning a Con-

cept

We consider the problem of incremental data collection on a given set of

pattern instances in order to estimate a concept from data. Each pattern

instance can be described by a given set of variables. Initially the dataset

is partially filled (or empty) and we have to decide how to fill it under

some constraints, with the goal of efficiently improving the estimation of

the concept under investigation. These constraints are:

• Limited resources: measuring variables over subjects is costly and we

assume that it is not feasible to measure every missing value but just

a subset.

• Domain constraints: in practical applications it is not feasible to col-

lect any possible subset of the missing entries but only some specific

26

CHAPTER 3. PROBLEM AND ALGORITHM 3.2. PROBLEM STATEMENT

subsets. Limits in instrumentation, or in the measurement process in

general, restrict the set of choices. It is common to be constrained

to collect just one missing entry at a time, or a batch of missing en-

tries of given size, or all missing entries for a given pattern instance

(e.g. if the measurement destroys the pattern instance, as in the case

illustrated in Section 4.5.1).

Furthermore, we are allowed to measure all new data in a sequence of

steps instead of all at once. Every time we collect some new information

we can use it to help decide what to collect next.

3.2.1 Missing Data Mechanism

Consider a rectangular dataset D where rows are drawn independently and

identically distributed (i.i.d.) from a probability distribution. We define

and indicator matrix I such that

Iij =

1 if xij is observed

0 if xij is missing(3.1)

where xij is the j-th feature value of i-th instance in the dataset. Let

the joint probability distribution of the data and the indicator matrix be

parametrized by θD, for the process generating the data, and θI for the

missingness mechanism:

p(D, I|θD, θI) = p(D|θD)p(I|D, θI) (3.2)

and let Do be the observed data and Dm the missing data. Little et al. [33,

48] defines three types of missingness mechanisms:

• Missing completely at random (MCAR): when p(I|D, θI) = p(I|θI),

meaning the missingness of a value is independent of all data previ-

ously acquired or the other missing values.

27


• Missing at random (MAR): when p(I|D, θI) = p(I|Do, θI), meaning

that a missing value may depend on the available values but not on

the missing ones.

• Not missing at random (NMAR): when p(I|D, θI) = p(I|Do, Dm, θI),

meaning that missing values may depend both on available data and

missing data.

In MAR and MCAR case it can be proved [33, 48, 19] that the joint prob-

ability distribution of the observed data and indicator matrix is equivalent

to

p(Do, I|θD, θI) = p(Do|θD)p(I|Do, θI) (3.3)

meaning that the estimation of θD can ignore θI , so we can estimate the

parameters governing the distribution of data without taking into account

the missingness mechanism. In the NMAR case this is not true and the

dependence between the two sets of parameters has to be taken into ac-

count.

In this research we assume that missing data is either missing completely

at random (MCAR) or missing at random (MAR). Since the proposed

method for incremental sampling exploits just the known data in order to

decide which is the best next sampling step, we can safely claim that the

MAR assumption holds. We have to assume the initial data being MAR

only in case an already partially filled dataset is provided to the method at

the first sampling step. When we conduct experiments on the asymptotic

behavior of the sampling policies where we begin with empty dataset, then

the MAR assumption on the initial dataset is always true. See details in

Chapter 5 about protocols of experiments.

28

CHAPTER 3. PROBLEM AND ALGORITHM 3.2. PROBLEM STATEMENT

3.2.2 Bayesian Estimation Theory

We consider the problem of estimating a random variable X whose dis-

tribution depends on a parameter vector θ. Given a parameter space Θ

containing θ and a decision space ∆ of all possible decisions (or estimates),

we can define a loss function L : Θ×∆ → R that expresses the loss we in-

cur when we decide to take a decision δ ∈ ∆ (i.e. we choose δ as estimate),

when the true state of the nature (i.e., the true value to estimate) is θ ∈ Θ.

Let φ : X → ∆ be a decision function assigning decision δ = φ(X) when

X is observed. We define the risk function R : Θ×∆ → R as the expected

loss with respect to X:

R(θ, φ) = E[L(θ, φ(X)] =

∫

X

L(θ, φ(x))pX|θ(x|θ)dx (3.4)

Bayes estimation theory [10, 41, 14] defines the Bayes risk as the ex-

pectation of the risk with respect to an assumed prior distribution Pθ on

θ:

r(Pθ, φ) = E[R(θ, φ)] =

∫

Θ

R(ϑ, φ)pθ(ϑ)dϑ. (3.5)

(where Pθ has p.d.f. pθ(ϑ)) and prescribes that in order to minimize the

loss due to incorrect estimation we have to minimize the associated Bayes

risk with the decision function φ∗ s.t.:

φ∗ = arg minφ

r(Pθ, φ) (3.6)

The Bayes risk is minimized when for each x when the action φ∗(x) is

taken, where φ∗(x) is given by

φ∗ = arg minφ

∫

Θ

L(ϑ, φ(x))pθ|X(ϑ|x)dϑ. (3.7)

The Bayes decision rule minimizes the posterior conditional expected

loss given the observations.

29


In case of squared-error loss:

L(θ, δ) = (θ − δ)2 (3.8)

the posterior expected loss given X = x:

E[L(θ, δ)|X = x] =

∫

Θ

(ϑ − δ)2pθ|X(ϑ|x)dϑ (3.9)

is minimized by taking δ as the mean of the posterior distribution [10, 41]:

θ = φ(x) = δ = Eθ[pθ|X(ϑ|x)] =

∫

Θ

ϑfθ|X=x(ϑ|x)dϑ. (3.10)

This estimate θ is called minimum mean-square estimate of the true value

θ and is denoted by θMS.

3.2.3 Bayesian Experimental Design

Bayesian experimental design [31, 6] provides a recipe to maximize util-

ity functions (U) in order to find the optimal choice (or design), among

alternatives, that the experimenter operates in order to pursue the most

effective experiment.

Let η ∈ H be a design and x ∈ X be the results of an experiment defined

by η. Let δ ∈ ∆ be a decision to take after observing x. Let θ ∈ Θ be

the unknown parameters of the problem. Then, a general utility function

is of the form U(δ, θ, η,x). The optimal design, according the framework

of Bayesian experimental design, is the one that maximizes the expected

utility of the best decision, that is:

η∗ = arg maxη∈H

∫

X

maxδ∈∆

∫

Θ

U(δ, θ, η,x)p(θ|x, η)p(x|η)dθdx (3.11)

Many utility functions has been proposed in literature. Among the most

popular are:

30

CHAPTER 3. PROBLEM AND ALGORITHM 3.3. MAC: DERIVATION

• The expected gain in Shannon information obtained by sampling a

design. Maximizing this utility function means maximizing the ex-

pected Kullback-Leibler divergence D(||) between the posterior (after

sampling) and the prior (before sampling) distribution [31]:


∫

X

∫

Θ

logp(θ|x, η)

p(θ)p(x, θ|η)dθdx =

= arg maxη∈H

∫

X

∫

Θ

D(p(θ|x, η)||p(θ))p(x|θ, η)dθdx. (3.12)

This kind of utility function leads to Bayesian D-optimal design when

derived on the normal linear regression model.

• According to Chaloner et al. [6], in case of experiments aimed to obtain

point estimates, a quadratic loss function may be appropriate, where

the loss is between the expected and estimated values. The utility

function to maximize gives:


(−

∫

X

∫

Θ

(θ − θ)TA(θ − θ)p(x, θ|η)dθdx

)

where A is a symmetric non negative definite matrix. This utility

leads to Bayesian A-optimal design when derived on the normal linear

regression model.

3.3 Maximum Average Change (MAC): Derivation

Let D = dii=1,...,N be a possibly incomplete dataset of records di each

corresponding to a pattern instance. The missing data pattern is assumed

to be missing at random (MAR). Let the random vector X = (X1, ..., XF )

corresponds to F variables that can be measured on any pattern instance

taking values in X1 × ... × XF . So each record is a random vector di =

(X1, ..., XF ).

31

3.3. MAC: DERIVATION CHAPTER 3. PROBLEM AND ALGORITHM

Let θ ∈ Θ be a random vector parametrizing the joint distribution over

X1 × ... ×XF . Let the random vector G(θ) = (G1(θ), ..., GQ(θ)) : Θ → RQ

be a vector function representing a concept we want to estimate from data.

At sampling step k we denote the partially filled dataset as Dk. The

Bayesian minimum mean squared error (MMSE) estimate of G given Dk

is given by Gk = G(Dk) = EG[G|Dk]. The mean quadratic error (MQE)

of the estimate with respect to the true value at step k is then:

MQEk =

∫

G

(Gk − G)TA(Gk − G)p(G|Dk)dG. (3.13)

where A is a symmetric non negative definite matrix introduced for simi-

larity to the utility functions of Bayesian experimental design (see Equa-

tion 3.13. In our case, the practical meaning of A depends on the exact

definition of the random vector function G; it can be used to weight each

component of the target concept. In the simplest case A would be a di-

agonal matrix where each element represents the cost of measuring the

corresponding variable. When G is a function to assess the importance of

the variables (i.e. Q = F ), A can be used to embed a cost model inside

the sampling algorithm. Non-diagonal terms can be used to express more

complex cost models, e.g. measuring two features together has a different

cost than sampling them separately. If A is the identity matrix I, the MQE

is the mean squared error (MSE):

MSEk =

∫

G

|Gk − G|2p(G|Dk)dG (3.14)

where |.| is the Euclidean norm.

Let Dmk be the set of all missing entries in Dk. Let Zk+1 = Z1, ...,ZM

be an indexed subset of the power set P(Dmk ) representing the set of dif-

ferent candidates (designs) among which the experimenter has to decide

for sampling at step k + 1. The set of constraints defining Z is problem

dependent. Each Zk+1l can be represented as a set of pairs (i, f). Each

32


pair (i, f) represents one missing entry in the current dataset, about the

record di for which the value of Xf is currently unknown.

If we assume that at step k + 1 the subset of missing entries Zk+1l of

the dataset is measured obtaining the values v = v1, ..., vNk+1

l ∈ Vk+1

l ⊂

(X1 × ... × XF ), then the new dataset, denoted as Dk+1 or (Dk,Zk+1l =

v), has the values v in the missing entries described by Zk+1l . The new

MMSE estimate of G as a function of (l,v), will be Gk+1 = G(Dk+1) =

EG[G|Dk,Zk+1l = v]. From now on we rename Zk+1

l → Zl and Vk+1l → Vl

for brevity.

The new mean quadratic error, as a function of (l,v), is then:

MQEk+1(l,v) =

∫

G

(Gk+1 −G)TA(Gk+1 −G)p(G|Dk,Zl = v)dG (3.15)

Since we do not know in advance what value would be obtained if we

did sample at Zl, we need to average the above quantity over all possible

outcomes in order to estimate the predicted mean quadratic error (denoted

MQEk+1(l):

MQEk+1(l)=

∫

Vl

∫

G

(Gk+1−G)TA(Gk+1−G)p(G|Dk,Zl=v)p(Zl=v|Dk)dGdv

(3.16)

The set of missing entries Zl∗ that minimizes the quantity above is the one

yielding the lowest predicted mean quadratic error, where:

l∗ = arg minl∈1,...,Mk+1

MQEk+1(l) (3.17)

This criterion is an application of the theory of Bayesian experimental

design (see Equation 3.11) when the utility function is Equation 3.13 where

θ = G, x = v and the design η is defined by the index l.

Now we illustrate an equivalent formulation of Equation 3.16 that is

more convenient for problems dealing with incremental sampling.

33

3.3. MAC: DERIVATION CHAPTER 3. PROBLEM AND ALGORITHM

Lemma 3.3.1 In order to minimize the predicted mean quadratic error of

the next sampling at step k + 1 described in Equation 3.16, we can equiv-

alently maximize the quadratic difference B(l) between the Bayes MMSE

estimates of the concept before and and after the measure is performed,

averaged over the possible outcomes. That is

B(l) =

∫

V

(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv. (3.18)

In order to show the equivalence we perform some manipulation of Equa-

tion 3.16. First we add and subtract Gk = EG[G|Dk] inside the terms of

the bilinear form and expand, so MQEk+1(l) becomes:

MQEk+1(l) =

∫

Vl

∫

G

(Gk+1−Gk)TA(Gk+1−Gk)p(G|Dk,Zl=v)p(Zl =v|Dk)dGdv

+ 2

∫

Vl

∫

G

(Gk+1−Gk)TA(Gk−G)p(G|Dk,Zl = v)p(Zl=v|Dk)dGdv

+

∫

Vl

∫

G

(Gk−G)TA(Gk−G)p(G|Dk,Zl=v)p(Zl=v|Dk)dGdx (3.19)

Since Gk+1 and Gk do not have a functional dependence on G the first

term can be rewritten as:∫

Vl

(Gk+1 − Gk)TA(Gk+1 − Gk)

(∫

G

p(G|Dk,Zl = v)dG

)p(Zl = v|Dk)dv

(3.20)

and the second term as:

2

∫

Vl

(Gk+1 − Gk)⊤p(Zl = v|Dk)A

(Gk

∫

G

p(G|Dk,Zl = v)dG

−

∫

G

Gp(G|Dk,Zl = v)dG

)dv. (3.21)

Of the three integrals inside brackets in the last two expressions the

first two evaluate to one (because some concept but be true) and the last

to EG[G|Dk,Zl = v] = Gk+1 showing that the second term is −2 times

34


the first. About the third term of the sum in Equation 3.19 we note that

neither Gk nor G depend on where and what the new sample is (they are

independent of l and v)) so this last part of the expression is constant.

The estimate of the mean quadratic error can be rewritten as:

MQEk+1(l) = const −

∫

Vl

(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv

(3.22)

which shows that in order to minimize it we need to maximize the expected

difference between the current estimate of Gk and the one at the next step

Gk+1.

We call B(l) the benefit function which has to be maximized to min-

imize MQEk+1(l):

B(l) =

∫

Vl

(Gk+1 − Gk)TA(Gk+1 − Gk)p(Zl = v|Dk)dv. (3.23)

The Bayes-optimal choice among all candidates is

l∗ = arg maxl∈1,...,M

B(l). (3.24)

when the utility function is Equation 3.13. We call this benefit algorithm

as Maximum Average Change (MAC) sampling algorithm.

3.3.1 MAC: Information Gain

Intuitively the optimal strategy to select the best candidate is the one that

selects the set of missing entries Zl that will provide the most information

regarding G. If Zl is sampled, then the information gained about G when

data Dk has already been acquired, is given by

I(Zl;G|Dk) =

∫

Vl

D(p(G|Dk,Zl = v)||p(G|Dk))p(Zl = v|Dk)dv (3.25)

where D(.||.) is the Kullback-Leibler divergence. Therefore the most in-

formative set of missing entries Zl to probe at iteration k + 1 is given

35

3.4. SUMMARY CHAPTER 3. PROBLEM AND ALGORITHM

by

l∗ = arg maxl∈1,...M

I(Zl;G|Dk) (3.26)

We note that the objective function to be optimized is similar to the Equa-

tion 3.12 in Bayesian experimental design. MacKay [35] calls it the total

information gain criterion.

Even when the prior distribution of G is known, this optimization is

often intractable except by expensive Monte-Carlo methods. Since the

Kullback-Leibler divergence is a measure of distance between distributions

of G, we can approximate it by the squared difference between estimates

of the function G from the data, before and after adding the new data

point. Thus to choose the most informative candidate Zl to measure, the

approximated information gain to be maximized is given by:

I(Zl;G|Dk) ≃

∫

Vl

(Gk+1 − Gk)T (Gk+1 − Gk)p(Zl = v|Dk)dv. (3.27)

That is, the best point to sample for new data is where we have the most

expected change in our current estimate. Intuitively, since we do not know

the true value of the variable we are estimating, we can learn the most by

trying to maximize the change from out current estimate. Equation 3.27

is exactly Equation 3.23 when A = I, which means that, in the limit of the

approximations made, the MAC algorithm selects the candidate providing

the highest information at each step.

3.4 Summary

In this Chapter we described the basic notation needed and state the sam-

pling problem in detail. The necessary background knowledge is briefly

introduced (together with references to the related literature), about a

taxonomy for patterns of missing data, Bayesian estimation theory and

Bayesian experimental design. The proposed solution, called Maximum

36

CHAPTER 3. PROBLEM AND ALGORITHM 3.4. SUMMARY

Average Change (MAC) sampling algorithm, has been derived from the

Bayesian experimental design framework. The MAC algorithm samples

the missing entry which minimizes the square loss between the true value

of the concept to learn and its current estimate. The minimization is done

maximizing a more manageable benefit function which depends just on

the current estimate and the next one, in an expected sense. An intuitive

interpretation of this result, in terms of information theory, was provided.

37

3.4. SUMMARY CHAPTER 3. PROBLEM AND ALGORITHM

38

Chapter 4

MAC Implementation: Examples

and Applications

In this Chapter we deal with the problem of implementing the Maximum

Average Change (MAC) sampling algorithm, derived in Section 3.3, to

solve practical problems.

The MAC algorithm presents several challenges to implementation. First

it deals with Bayes MMSE estimates of the concept G to learn (E[G|Dk]),

that assume knowledge of the prior distribution of G. Furthermore it is

necessary to compute the conditional probability pV(Zl = v|Dk) of mea-

suring given values in given missing entries of the dataset, at each sampling

step. In both cases the dataset has missing values and proper techniques

to compute probability distributions and estimates need to be applied.

In the following we show some practical uses of the MAC algorithm

as main contributions of this research. They range from simple examples

to more complex applications. We will provide solutions to the issues

described previously using different techniques and assumptions.

As it will be shown later, the challenges to implement the MAC algo-

rithm depend on the exact target concept to learn and the constraints on

the incremental data filling scheme.

39

4.1. RESTRICTIVE ASSUMPTIONS CHAPTER 4. MAC: IMPLEMENTATION

4.1 MAC: More Restrictive Assumptions

In all implementations illustrated in this chapter we restrict the assump-

tions given in the derivation of the MAC algorithm (see Chapter 3). In all

cases, from now on, we assume that there are:

1. Constraints on what can be sampled at each step: at each sampling

step the experimenter is allowed to sample only one missing entry at

a time, so she has to decide which of the missing values, among those

left, improves most the estimate of the concept to be learned.

2. Constraints on cost model: the cost model is assumed to be flat,

meaning that every variable has the same unitary cost.

3. Reasonable estimates of the Bayes MMSE estimates G of the concept

to learn, if its prior probability is not available.

Even though these assumptions seem very restrictive they arise from several

needs. They are necessary to compare our algorithm to other algorithms

proposed in literature, they partly reflect the restrictions imposed by the

domains of applications (biomedical and agricultural domains) and they

reduce the complexity of the search space. See Chapter 6 for a description

of the future activity aimed to relax some of these new assumptions.

The new assumptions lead to a simplified version of the MAC algorithm.

The flat cost model implies that A = I, at least for the examples and

applications of this Chapter. Restricting what the experimenter can sample

at each step to one missing entry, allows to rewrite the set of allowed

candidates at each step to

Zk+1l ∈ (il, fl)l=1,...,M−k (4.1)

where M is the initial number of missing values, k is the sampling step,

and (i, f) denotes the i-th missing entry for variable Xf . The number of

40

CHAPTER 4. MAC: IMPLEMENTATION 4.2. SUMMARY

candidates is M − k in Equation 4.1 because after k sampling steps there

are M −k missing values remaining. The benefit function of Equation 3.23

then becomes

B(i, f) =

∫

Xf

(Gk+1 − Gk)T (Gk+1 − Gk)p(Xif = x|Dk)dx.

=

Q∑

j=1

∫

Xf

(Gj,k+1 − Gj,k)2p(Xif = x|Dk)dx (4.2)

where Gj,k is the j-th component of Gk and Xif : Ω → Xf is the ran-

dom variable describing the outcomes of the missing entry (i, f) of pattern

instance i on variable Xf .

4.2 Summary of Examples and Applications

Here follows a brief summary of each example and application given in this

chapter.

Examples

1. Guess my number [1]. We apply MAC algorithm to the game of guess-

ing a secret number given that we know whether previous attempts

were smaller or grater than the unknown value.

2. Conditional probability of two binary variables. Given two binary vari-

ables, we sample them in pairs with the target of estimating the con-

ditional probability of the second given the first. We are allowed to

choose the value of the first variable and measure the corresponding

value of the second one. We apply MAC algorithm to decide which ac-

tual value of the first variable to choose in order to estimate efficiently

the conditional probability.

41

4.2. SUMMARY CHAPTER 4. MAC: IMPLEMENTATION

Applications to Machine Learning Problems. As explained in the motiva-

tions given in Section 1.2, the main practical problem we address within

this research is estimating the relevances of variables in a class-labeled

dataset. Relevance is defined as the contribution a variable gives to pre-

dicting class labels. More detailed definitions will be given in the following

sections. Note that, even though the MAC algorithm does not explicitly

mention class labels in its formulation, it is sufficient to consider them as

another variable describing the pattern instances. The special role played

by class labels is meaningful only when computing the concept to learn G

and its estimates, whose form and formulation are domain dependent.

In this research we present two Machine Learning applications:

1. Relevance estimation of one new variable. Given a class-labelled dataset

(describing a set of monitored instances) and the error rate of a pre-

diction method on it, we want to estimate efficiently the reduction of

the error rate induced by adding one new variable. Initially the values

of this variable are unknown on all instances of the dataset. The prob-

lem is to decide on which instances to measure these missing entries,

assuming that each measurement is costly, only a limited budget is

available and only one measure at a time is feasible. If we assume

that we have a pool of candidate new variables (instead of just one),

then we repeat the previous relevance assessment on each of them

and, eventually, select the one which improves most the prediction.

An example of this application has been introduced in Section 1.1.1.

2. Concurrent relevance estimation of multiple variables. How to effi-

ciently estimate multiple feature relevances in a labelled dataset when

some (or all) feature values are missing, and sampling is performed

incrementally and concurrently on all features? As stated before, we

assume that only one missing entry at a time can be measured and

42

CHAPTER 4. MAC: IMPLEMENTATION 4.3. EXAMPLE 1: NUMBER

decide which missing entry to measure through the MAC algorithm.

4.3 Example 1: Guess My Number, or Learning a

Step Function

Number [1, 61] is an early 1970s text-based computer game where the

player is requested to iteratively guess a secret number selected from a

known integer interval. After each attempt an evaluation is given back to

the player stating whether the guess is bigger or smaller than the number

to guess. The game ends when the guess matches the secret number.

This game describes an active learning problem where the player needs to

estimate an unknown value (the secret number) using incremental sampling

and based on the information of outcomes of previous attempts. In this

section we study an extension of the original game where the set of available

values for the secret number is the real interval (0, 1). We show that the

application of the MAC algorithm in this context allows to derive the well

known binary search [8] algorithm.

In terms of the active learning problems described in this research the

game Number can be modeled as learning a step function f

y = f(x) =

0 if x < θ

1 if x ≥ θ(4.3)

where the parameter θ is the secret number to guess. We assume the prior

distribution of θ to be uniform on (0, 1): θ ∼ U(0, 1) . We are allowed to

iteratively query (sample) the value of y at any value of x of our choosing to

estimate θ. The set of monitored pattern instances is then the continuous

interval (0, 1).

After k sampling steps let xL be the highest sample point with y = 0, i.e.

xL = maxx∈xi=1,...,k x|f(xi) = 0, and xR be the lowest sample with y = 1,

43

4.3. EXAMPLE 1: NUMBER CHAPTER 4. MAC: IMPLEMENTATION

i.e. xR = minx∈xi=1,...,k x|f(xi) = 1. Since the posterior distribution of

θ given data is uniform on (xL, xR), θ ∼ U(xL, xR), the Bayes MMSE

estimate of θ given this data is

θMS = E[p(ϑ|x)] =

∫ 1

0

p(ϑ|x)p(ϑ)dϑ =

=

∫ 1

0

U(xL, xR)U(0, 1)dϑ =xR − xL

2(4.4)

Let us now compute the MAC benefit B(x) of sampling at a point

x ∈ (xL, xR) at step k + 1. First, since the posterior distribution of θ is

U(xL, xR), the probability of obtaining y = 0 when x is sampled is

p(y = 0|x) = (xR − x)/(xR − xL) (4.5)

and the new estimate of θ, if y = 0, would be

θ0k+1 = (x + xR)/2. (4.6)

Similarly

p(y = 1|x) = (x − xL)/(xR − xL) (4.7)

and the new estimate, if y = 1, would be

θ1K+1 = (x + xL)/2. (4.8)

We can rewrite the benefit function of Equation 4.2 as

B(x) =(θ0k+1 − θk

)2

p(y = 0|x) +(θ1k+1 − θk

)2

p(y = 1|x) =

=

(x + xR

2−

xL + xR

2

)2xR − x

xR − xL

+

(x + xL

2−

xL + xR

2

)2x − xL

xR − xL

=

=(xR − x)(x − xL)

4(4.9)

and we note that the maximum of B(x) is where

x =xL + xR

2. (4.10)

44

CHAPTER 4. MAC: IMPLEMENTATION 4.4. EXAMPLE 2: COND.PROB.

Therefore the value of x that maximizes B(x) is (xL+xR)/2 implying that,

for this problem, the MAC sampling algorithm behaves identical to the

familiar binary search algorithm.

4.4 Example 2: Estimation of Conditional Probabil-

ities

Consider the urn experiment where a set of N balls, an unknown fraction

being black and the remaining part being white, are present. At each step

the experimenter picks one ball from the urn. Assume that the experi-

menter can request to examine one ball of the desired color (if available).

After receiving the ball the experimenter opens it: inside each ball there

is a number: 0 or 1. The goal of this experiment is to estimate the condi-

tional probability of getting 0 (or 1) given the color of the ball, examining

as few balls as possible.

Since at each step the experimenter has to decide which ball to pick

form the urn (black or white), a policy has to be defined. A very inefficient

sampling policy could be to picks black balls first and white balls only

later, after the black ones are exhausted. The estimate of the conditional

probability of getting 1 (or 0) given black balls would converge extremely

fast to the true value; but the conditional probabilities given white balls

would converge extremely slowly, since no useful information would be

available before all black balls are picked. A more efficient policy would be

the random one, that picks one ball at random from the urn without taking

into account the color but according to the distribution of colors. Other

policies can be proposed. In the following we derive the MAC sampling

policy for this problem.

45

4.4. EXAMPLE 2: COND.PROB. CHAPTER 4. MAC: IMPLEMENTATION

4.4.1 Formal Description of the Urn Experiment

Let X1 and X2 be binary random variables having values in 0, 1 and

unknown relation. Assume we can sample a pair < X1, X2 > deciding

which value of X1 we like and then measuring the corresponding X2 value.

In table 4.1 it is shown that the p.m.f. P (X2 = x2|X1 = x1) depends upon

two parameters, a and b.

X2 = 0 X2 = 1

X1 = 0 a 1 − a

X1 = 1 1 − b b

Table 4.1: Conditional probabilities P (X2 = x2|X1 = x1) parametrized by a and b

.

Assuming that the marginal distribution of X1 is P (X1 = 0) = c,

P (X1 = 1) = 1 − c we derive the joint p.m.f P (X1, X2), as shown in

table 4.2.

X2 = 0 X2 = 1

X1 = 0 ca c(1 − a) c

X1 = 1 (1 − c)(1 − b) (1 − c)b 1-c

Table 4.2: Joint probability P (X1 = x1, X2 = x2) parametrized by a, b and c.

4.4.2 MAC sampling algorithm

We now compute the benefit Bk+1 of sampling at X1 = 0 or X1 = 1 after k

sampling steps, using Equation 4.2. We denote ak and bk the Bayes MMSE

estimates of parameters a and b computed from data collected until step

k. Then the target concept to learn is G = (a, b) and G0,k = ak, G1,k = bk.

46


The benefit function when sampling at step k + 1 is then:

Bk+1(X1 = x1) =

Q∑

j=1

∫

Xf

(Gj,k+1 − Gj,k)2p(Xif = x|Dk)dx =

= (ak+1,s=(x1,0) − ak)2P (X2 = 0|Dk, X1 = x1) +

+ (bk+1,s=(x1,0) − bk)2P (X2 = 0|Dk, X1 = x1) +

+ (ak+1,s=(x1,1) − ak)2P (X2 = 1|Dk, X1 = x1) +

+ (bk+1,s=(x1,1) − bk)2P (X2 = 1|Dk, X1 = x1) (4.11)

where ak+1,s=(x1,x2) denotes an estimate of a using current data Dk plus

the sample < X1 = x1, X2 = x2 > and P (X2 = x2|Dk, X1 = x1) is the

probability of obtaining X2 = x2 when sampling at X1 = x1 after observing

data Dk. Note that

P (X2 = 0|Dk, X1 = 0) = ak

P (X2 = 1|Dk, X1 = 0) = 1 − ak

P (X2 = 0|Dk, X1 = 1) = 1 − bk

P (X2 = 1|Dk, X1 = 1) = bk.

(4.12)

4.4.3 Explicit Benefit Function

Let nij, where i, j ∈ 0, 1, denote the total number of samples where that

have been observed at step k for which X1 = i and X2 = j as shown in

Table 4.3.

X2 = 0 X2 = 1

X1 = 0 n00 n01

X1 = 1 n10 n11

Table 4.3: Binary counts: nij is the number of observations for which X1 = i and X2 = j.

47

4.4. EXAMPLE 2: COND.PROB. CHAPTER 4. MAC: IMPLEMENTATION

Under the assumption of squared-error loss and Beta(A, B) prior1 the

Bayesian MMSE estimator for the parameter of the conditional distribution

a and b is the posterior distribution of Bernoulli trials under Beta(A, B)

prior (see [3]):

aMS =n00 + A

n00 + n01 + A + B

bMS =n11 + A

n10 + n11 + A + B. (4.13)

Since the estimate of a is not affected by samples having X1 = 1 and

the estimate of b is not affected by samples having X1 = 0, the benefit

Equation 4.11 can be rewritten as:

Bk+1(X1 = 0) = (ak+1,s=(0,0) − ak)2ak + (ak+1,s=(0,1) − ak)

2(1 − ak).

and

Bk+1(X1 = 1) = (bk+1,s=(1,0) − bk)2(1 − bk) + (bk+1,s=(1,1) − bk)

2bk.

Inserting the explicit formula of estimators we can derive the final algo-

rithm for computing the benefit function at step k + 1:

Bk+1(X1 = 0) =

(n00 + A + 1

n00 + n01 + A + B + 1−

n00 + A

n00 + n01 + A + B

)2

×n00 + A

n00 + n01 + A + B+

+

(n00 + A

n00 + n01 + A + B + 1−

n00 + A

n00 + n01 + A + B

)2

+

×

(1 −

n00 + A

n00 + n01 + A + B

)=

=(n00 + A)(n01 + B)

(n01 + n00 + A + B)2(n01 + n00 + A + B + 1)2(4.14)

1Note that, in principle, parameters A and B of the prior on a are different from those of b. Here we

assume they are equal since we have no prior knowledge to assume a different belief.

48


when sampling at X1 = 0 and

Bk+1(X1 = 1) =(n11 + A)(n10 + B)

(n10 + n11 + A + B)2(n10 + n11 + A + B + 1)2(4.15)

when sampling at X1 = 1.

The previous results shows some interesting aspects of the benefit for-

mulas:

• Bk+1(X1 = 0) and Bk+1(X1 = 1) have same form.

• The benefit of sampling at X1 = xi always decreases if sampling is

actually done, otherwise remains constant. The reason is that the

denominator of Equations 4.14 and 4.15 grows much faster (O(n4))

than the numerator (O(n2)).

Since the benefit of one candidate decreases when it is actually sampled,

the MAC sampling policy will alternate sampling at X1 = 0 and X1 = 1,

even though not strictly.

When the number of samples with X = 0 is the same of those with

X1 = 1, the denominators of Bk+1(X1 = 0) and Bk+1(X1 = 1) are equal.

Then, the MAC policy will sample where the numerator is greater, i.e. at

X1 = 0 if

(n00 + A)(n01 + B) > (n11 + A)(n10 + B)

otherwise at X1 = 12. If A and B are negligible with respect to nij then

MAC policy will sample where the product ni0ni1 is bigger. Since the

maximum of the product is when ni0 = ni1 we claim that the MAC policy

will sample where the conditional probability is closer to 1/2.

We note that the MAC policy is not affected by the marginal probability

P (X1). This is different from the random sampling policy which selects the

value of X1 only according to P (X1) and could become very inefficient in

2In case numerators are equal MAC policy will choose with equal probability X1 = 0 or X1 = 0.

49

4.5. APPLICATION 1 CHAPTER 4. MAC: IMPLEMENTATION

estimating both a and b when P (X1 = 0) ≫ P (X1 = 1). See experimental

results in Section 5.1 for a comparison of the MAC policy against the

random sampling policy.

4.5 Application 1: Single Feature Relevance Estima-

tion

In this application of the MAC algorithm we study the learning problem

of evaluating new candidate variables describing a set of pattern instances

with respect to given class labels. Assume that the pattern instances are

described by a known set of variables whose values, together with class

labels, are fully known in advance. A given predictor is built over this

known data and its error rate is computed. The final goal of this application

is to add one new variable to those already in use selecting it from a pool

of new candidates whose values are initially completely unknown. The

selected new variable will be the one that, together with the initial data,

will reduce most the error rate of the predictor.

Since measuring new candidate variables is assumed to be costly we

have to carefully select on which instance to perform a measurement at

each step. After collecting the new value the error rate of the prediction is

recomputed.

The whole process can be described as follows:

1. Select one new candidate variable (at random).

2. Iteratively select the most interesting missing entry (using a sampling

policy) and measure it, spending a part of the per-variable budget.

We assume that each new candidate variable has the same budget,

meaning that each one will be measured the same number of times.

3. Update the relevance of the candidate variable.

50

CHAPTER 4. MAC: IMPLEMENTATION 4.5. APPLICATION 1

4. Loop on step 2 until budget for that variable is exhausted.

5. Go back to step 1 and select another candidate variable until all can-

didates are evaluated and the global budget is exhausted.

6. Rank candidate variables according to their estimated relevances.

7. Add the most relevant variable to the set of known variables in a new

predictor and deploy it.

If a given variable would be measured on all pattern instances, then we

would be able to know what is the exact change in error rate due to the

contribution of that variable on that dataset. The target of this process

is then to efficiently estimate the change in error rate of the predictor

measuring the new variable a few number of times on the most informative

instances. At the end of the process another candidate variable undergoes

the same process and its reduction of the error rate is estimated. The

variable that reduces most the initial error rate will be added to the initial

set of variables for building a more efficient predictor.

4.5.1 Motivation

In the following we motivate this problem in biomedical and agricultural

domain. The motivation has already been mentioned in Section 1.2. Here

we add the necessary details in order to be able to elicit an abstract for-

mulation of the underlying process. After formalization we will derive the

specific implementation of the MAC algorithm. Furthermore, we will de-

rive the implementation of a baseline random sampling algorithm and other

algorithms from the literature to enable experimental comparisons.

51


Cancer Characterization

Current models for cancer characterization, that lead to diagnostic/prognostic

models, mainly involve the histological parameters (such as grade of the

disease, tumor dimensions, lymphonode status) and biochemical parame-

ters (such as the estrogen receptor). The diagnostic models used for clinical

cancer care are not yet definitive and the prognostic and therapy response

models do not accurately predict patient outcome and follow up. For ex-

ample, for lung cancer, individuals affected by the same disease and equally

treated demonstrate different treatment responses, evidencing that still un-

known tumor subclasses (different istotypes) exist. This incomplete view

results, at times, in the unnecessary over-treatment of patients, that is

some patients do not benefit from the treatment they undertake. Diagnos-

tic and prognostic models used in clinical cancer care can be improved by

embedding new biomedical knowledge. The ultimate goal is to improve the

diagnostic and prognostic ability of the pathologists and clinicians leading

to better decisions for treatment and care.

Ongoing research in the study and characterization of cancer is aimed

at the refinement of the current diagnostic and prognostic models. As

disease development and progression are governed by gene and protein

behavior, new biomarkers associated with patient diagnosis or prognosis

are investigated. The identification of new potential biomarkers is recently

driven by high throughput technologies, called microarrays. They enable

the identification of genes that provide information with a potential impact

on understanding disease development and progression [21].

Although the initial high throughput discovery techniques are rapid,

they often only provide qualitative data. Promising genes are further ana-

lyzed by using other experimental approaches (focusing on DNA, RNA or

proteins), to test specific hypotheses. Usually a well characterized dataset

52


of tumor samples from a retrospective population of patients is identified

and the experimental process of analyzing one biomarker (feature) on one

sample at a time is conducted. These analyses are usually based on com-

parisons between 1)non-diseased (i.e., normal) and diseased (i.e., tumors)

biological samples, 2)between diseased samples pharmacologically treated

and untreated at variable time points or 3)between samples of different

diseases. The efficacy of specific biomakers can for example be determined

based on their discriminative power in distinguishing between patients with

poor or good prognosis, meaning patients with short or long overall sur-

vival respectively or cancer recurring or not recurring. This process can be

time consuming, depending on the type of experimental technique which

is adopted.

More importantly, well annotated tissue samples are very precious. Mon-

itoring the status of patients over years, even decades, and storing tissue

samples so as to be useful for studies is not trivial and requires organiza-

tional efforts. It is not uncommon, for example, that patients who received

the treatment at a hospital, will be monitored during the follow up period

in another hospital and even in another country. Therefore keeping track

of their status may become quite difficult. When the biomarker is tested

on a biological sample, a portion of the sample is consumed, implying that

each sample can be used for only a finite number of experiments. This

motivates the need to develop an active sampling approach to conserve the

valuable biological sample resource [59].

Apple Proliferation

Our interest in sampling strategies began in a research project (SMAP)3 in

the domain of agriculture, dealing with the Apple Proliferation disease [51]

3This work was funded by Fondo Progetti PAT, SMAP (Scopazzi del Melo - Apple Proliferation), art.

9, Legge Provinciale 3/2000, DGP n. 1060 dd. 04/05/01.

53


in apple trees. Biologists monitor a distributed collection of apple trees

affected by the disease with the goal of determining the symptoms that

indicate the presence of the disease causing phytoplasma. A data archive

is arranged with a finite set of records, each describing a single apple tree.

All the instances are labeled as infected or not infected. Each year the

biologists propose new candidate features (e.g, color of leaves, altitude of

the tree, new chemical tests etc.) that could be extracted (or measured)

to extend the archive, so as to arrive at more accurate models. Since the

data collection on the field can be very expensive or time consuming, a

data acquisition plan needs to be developed by selecting a subset of the

most relevant candidate features that are to be acquired on all trees [60].

The selection of a new candidate variable among many for the next,

extensive, data collection campaign, is accomplished through a process

equivalent to the biomarkers case.

4.5.2 Implementation: Error Rate Estimation

In order to decide how much each new candidate feature Xi ∈ X1, . . . , XW

is relevant, in the sense introduced so far, we need to evaluate the benefit

of adding it to the current set of features X = X1, . . . , XN when predicting

the class label C. Therefore the concept G to be estimated is the classi-

fication error rate (denoted by ǫ) of a given classifier on the feature space

comprising the known features and the candidate new feature.

We assume features and class labels to be categorical and compute the

error rate by summing over the joint feature and class probabilities of all

but the winning class over the entire feature space. Although we do not

make the Naıve Bayes assumption of class-conditional feature indepen-

dence, it would be easy to incorporate into the algorithm any independence

information supplied by the domain expert.

We note that our sampling algorithm can be applied with any other

54


appropriate measure for feature relevance (such as the width of the classi-

fication margin [55]) and estimate of the conditional feature probability.

We now briefly describe how the active sampling algorithm is imple-

mented and provide the equations for the estimation of the probability

distribution and of the classification error rate. All probability distribu-

tions are assumed multivariate categorical whose parameters are estimated

from data using Bayes MMSE estimators under uniform Dirichlet priors.

Due to the difficulty in obtaining the exact Bayes MMSE estimate of the

error rate, we approximate it by the error rate computed from the Bayes

estimate of the distribution p(C,X, X) over C × X1 × . . .XN × X , where

X ∈ X is the new candidate variable.

At a given iteration k of the active sampling process some of the in-

stances have feature value X missing. Moreover because of the active sam-

pling, the missing values are not uniformly distributed. MacKay in [35]

asserts that the biases introduced in our estimates because of non-random

sampling can be avoided by taking into account how we gathered the data.

Therefore to construct the estimator p(C,X, X) over C ×X1 × . . .XN × X

it is necessary to consider the sampling process. Since all the examples

in the database are completely described with respect to C and X, we

already have the density p(C,X). Although at any iteration of the active

sampling algorithm the X values are missing non-uniformly across various

configurations of (C,X), for each pattern instance s = (c,x) the samples

for X are independent and identically distributed. We incorporate this in-

formation in the estimator of the probability density from incomplete data

Dk as follows. We first calculate

p(X = x|Dk) = pDk(x|c,x) =

nc,x,x + 1∑

X nc,x,x + |X |(4.16)

where nc,x,y is the number of instances of the particular combination of

(c,x, x) among all the completely described instances in Dk, and |X | is the

55


size of the sample space X . The probability density over C × X1 × . . . ×

XN × X is then calculated as

pDk(c,x, x) = pDk

(X = x|c,x) × p(c,x) (4.17)

Once we have the estimate pDk(c,x, x), the error rate ǫ(Dk) can be esti-

mated as

ǫ(Dk) = 1 −∑

X1×...×XN×X

maxc∈C

pDk(c,x, x) (4.18)

4.5.3 Sampling Algorithms Compared

We compare the MAC sampling algorithm with others previously proposed

in the literature that can be adapted to this context (See Section 2.3).

These algorithms differ only in the benefit criterion (B) of sampling at a

given instance s = (c,x). In the following we illustrate the exact benefit

functions under comparison for each case, and start with our proposed one:

1. Maximum Average Change (MAC)

Given a candidate feature X ∈ Xii=1,...,W , we measure iteratively

its values. In order to decide which entry to probe we loop over all

s = (c,x) where X is missing and compute the benefit of sampling at

s given by Equation 4.2, which can be rewritten as

B(s) =

∫

X

(ǫ(Dk, Xs = x) − ǫ(Dk))

2p(Xs = x|Dk)dx (4.19)

where Xs = x means that instance s has value x ∈ X on variable X.

We then perform the sampling on s with the maximum value of B(s).

2. Single Feature Lookahead (SFL)

The Single Feature Lookahead active feature sampling heuristic was

proposed in [34]. Their algorithm, which chooses an instance on which

56


a feature value is queried only by its class label, was developed for bud-

geted learning of Naive Bayes classifiers. We implemented a straight-

forward extension to their algorithm for choosing the combination of

both the class label and the previous feature value. The SFL algorithm

works as follows.

For each s ∈ C × X1 × . . . × XN it is assumed that d samples with

the particular s are to be probed, where d is called max-depth. The

expected error rate of the classifier is computed and the instance s with

the lowest error rate is selected and the feature value of X is collected

once. We refer the reader to [34] for more details on the algorithm.

When the max-depth parameter is set to d = 1, the SFL heuristic

reduces to the Greedy Loss Reduction heuristic, also presented in [34].

The benefit function used for active sampling can be written as

B(s) = −E[ǫ(Dk, (X = x)d)] (4.20)

where (X = x)d indicates that instead of just one, d samples are

assumed to be acquired. The expectation is over all the possible results

of collecting X on d samples with the particular s.

3. Impute & Minimum Error (IME)

This heuristic is identical to the Goal Oriented Data Acquisition(GODA)

algorithm proposed in [65] (albeit using a different estimate of the er-

ror rate). They first impute the value of X for the s in question from

the available data and then compute the error rate of the classifier

built after adding the imputed sample to Dk. The sample s that

yields the lowest error rate (given current data) is chosen for probing.

B(s) = −ǫ(Dk, X = x∗) (4.21)

where

x∗ = arg maxx∈X

p(X = x|Dk)

57


4. Random Sampling At sampling step k the random sampling scheme

just chooses a missing entry randomly from Dk to be probed. The

results obtained from the random sampling method serve as a baseline

against which the other sampling heuristics are evaluated.

For results on comparing these different sampling algorithms and the

proposed method over synthetic and benchmark datasets, as well as on

biomarkers and agricultural domains, see Section 5.2.

4.6 Application 2: Concurrent Estimation of Multi-

ple Feature Relevances

In this application of the MAC algorithm we study again the learning prob-

lem of evaluating variables that describe pattern instances with respect to

given class labels, with the final goal of variable selection. Assume that

pattern instances are described by a set of variables and class labels in a

dataset. Assume that initially the actual class labels are fully known for

every instance, but feature values are just partially known (or completely

unknown). The experimenter is allowed to iteratively probe the missing

value of one instance over one variable at a time; measuring new values is

assumed to be costly and only a limited budget is available. The aim of this

application is then to estimate the relevances of all variables that describe

the instances when predicting class labels, while allocating efficiently the

budget for acquiring new values. In contrast to from Application 1 (see

Section 4.5) missing values are present in more than one variable in the

dataset at each sampling step, meaning that deciding where to measure the

next value is deciding which pattern instance to sample but also which fea-

ture. This application then defines a process where each variable competes

against the others to be selected for sampling for concurrent relevance esti-

mation. As an example of the differences with respect to Application 1, it

58


could happen that some features will be sampled more times than others.

The schema of this sampling process is shown in Figure 1.2.

We will clarify only later the exact method to compute the variables’

relevances. The aim is to provide a plug-in architecture for the feature

evaluation process where the experimenter can, in principle, decide among

different feature raters to better suite domain needs.

4.6.1 Motivation

The same motivation explained in Section 4.5.1 for the case of Application

1 applies here. In the case of cancer characterization using biomarkers we

refer now to a slightly different scenario where all biomarkers are evaluated

concurrently; then, the dataset comprises class labels, all known variables

and all new candidate variables. Since the missing values are present only

in this last group, the competition is just among new candidate variables.

We refer to a similar scenario for the agricultural domain investigating

Apple Proliferation disease.

4.6.2 Implementation of the Active Sampling Algorithm

We describe the sampling process of this application using Algorithm 4.6.1.

This Algorithm for active feature value acquisition is general in the sense

that it can incorporate any measure for feature relevance for which the

squared-error loss is reasonable. That is, the function EstimateRelevances(D)

in the pseudocode can be any estimate of feature relevance that can be es-

timated from a dataset with missing values.

In the following we illustrate the details of the assumptions made to

derive an implementation of the Equation 4.2. First we present the model

for data generation (i.e., the joint class-and-feature distribution), then we

explain how the conditional probabilities and feature relevances (the two

59


main ingredients of the MAC algorithm) can be computed given the joint

distribution.

Our model is applicable for problems with categorical valued features.

That is, we assume that every feature Xf takes on a discrete and finite set

of values Xf = 1, . . . , Vf.

Algorithm

4.6.1: AcquireOneMissingValue(Dk)

G(Dk) = EstimateRelevances(Dk)

for each (i, f) such that record i has feature value f missing

B[i, f ] ← 0 comment: Initialize the value of the benefit to zero

for each x ∈ Xf

Dtemp = Dk.F illV alue(Xif = x)

G(DTemp) = EstimateRelevances(DTemp)

B[i, f ] = B[i, f ] + ComputeBenefit(G(Dtemp), G(Dk))

end

end

comment: Now find the missing entry with the highest benefit

(i∗, f ∗) = arg maxi,f(B[i, f ])

comment: Now query the value for the missing entry

x∗ = SampleMissingV alue(i∗, f ∗)

comment: Fill the missing value

Dk+1 = Dk.F illV alue(Xi∗f∗ = x∗)

return (Dk+1)

60


4.6.3 Mixture model

A convex combination of probability distributions is called probability mix-

ture model or mixture model

P (X = x) =M∑

m=1

αmpm(X = x) (4.22)

where X is a random vector taking values x ∈ X , 0 ≤ α ≤ 1 are the

coefficients expressing the contribution of each component pm(X = x)

and∑M

m=1 αm = 1. Usually the probability distributions pm come from a

parametric family with unknown parameter θm. In this case the probability

mixture model is

P (X = x) =M∑

m=1

αmp(X = x|θm). (4.23)

Even though continuous mixture models

P (X = x) =

∫

Θ

α(θ)p(X = x|θ)dθ (4.24)

are considered in literature we do not address them in this research. See

[37, 36] for a thorough review of mixture models.

(Non-)Identifiability

Due to the summation in Equation 4.22 the components of mixture mod-

els are exchangeable, in the sense that we cannot distinguish a component

from another component when we want to interpret the parameters (αm)

discovered by fitting the model. Moreover, with certain parametric fam-

ilies of probability distributions (e.g. multivariate Bernoulli) completely

different sets of components [23] lead to the exactly the same mixture

model. The whole problem is known in literature as identifiability (or

61


non-identifiability)[37]; the first issue (exchangeability) is known as triv-

ial identifiability and the latter, specific to certain families of probability

distributions, as non-trivial identifiability.

Since we are not concerned with interpretation of the parameters of the

mixture model in this research we do not investigate the techniques to

overcome the identifiability issue. We claim that neither trivial nor non-

trivial identifiability are relevant to the use of mixture models made in this

research.

Further information on techniques to overcome the identifiability issues

can be found in the literature of mixture models, e.g. in the book of

McLachlan et Peel [37] and in Carreira-Perpinan et al. [5].

4.6.4 Class-Conditional Mixture of Product Distributions

We assume that each class-conditional feature distribution is a mixture

of M product distributions over the features. Although for our imple-

mentation it is not necessary that the number of components be constant

across classes, we make this assumption for simplicity. That is, the class-

conditional feature distribution for class c ∈ C is

P (X1 = x1, . . . , XF = xF |C = c) =M∑

m=1

αcm

F∏

f=1

Vf∏

x=1

θδ(x,xf )cmfx (4.25)

where αcm is the mixture weight of component m for class c, θcmfx is the

probability that the feature f takes on the value x for component m and

class c, and δ(.) is the Kronecker delta function. Note that if M = 1 our

model is equivalent to the Naıve Bayes model.

Therefore the full class-and-feature joint distribution can be written as

P (C = c, X1 = x1, . . . , XF = xF ) =∑

c∈C

p(C = c)M∑

m=1

αcm

F∏

f=1

Vf∏

x=1

θδ(x,xf )cmfx

(4.26)

62


where p(C = c) is class probability. The class-and-feature joint distribution

is completely specified by the parameters αs, θs and the class probabilities.

Before we describe how the α and θ parameters can be estimated from

a dataset with missing values, we will explain how feature relevances and

the conditional probability p(Xif = x|Dk) are calculated if the parameters

are known.

4.6.5 Calculation of Feature Relevances

We use the mutual information between a feature and the class variable as

our measure of the relevance of that feature. That is

Gf = I(Xf ; C) = H(Xf) − H(Xf |C) (4.27)

Although we are aware of the shortcomings of mutual information as a

feature relevance measure, especially for problems where there are inter-

feature correlations, we chose it because it is easy to interpret and to

compute given the joint class-and-feature distribution. We did not use

approaches such as Relief [26], SIMBA [20] or I-Relief [55], that provide

feature weights (that can be interpreted as relevances), because they do

not easily generalize to data with missing values. See Chapter 6 for future

plans on this topic.

The entropies in Equation 4.27 can be computed as follows.

H(Xf) = −∑

c∈C

Vf∑

x=1

p(C = c, Xf = x) log(p(C = c, Xf = x)) (4.28)

H(Xf |C) = −∑

c∈C

Vf∑

x=1

p(Xf = x|C = c) log(p(Xf = x|C = c))p(C = c)

(4.29)

If the α and θ parameters and p(c) of the model are known, the mutual

63


information can be computed as follows.

H(Xf) = −

Vf∑

x=1

(∑

c∈C

p(c)M∑

m=1

αcmθcmfx

)log

(∑

c∈C

p(c)M∑

m=1

αcmθcmfx

)

(4.30)

H(Xf |c) = −∑

c∈C

p(c)

Vf∑

x=1

(M∑

m=1

αcmθcmfx

)log

(M∑

m=1

αcmθcmfx

)(4.31)

4.6.6 Calculation of Conditional Probabilities

For simplicity in the following we omit the subscript k in all notations even

though the values we refer to will change at each sampling step.

Since the instances in the dataset D are assumed to be drawn indepen-

dently, we have

p(Xif = x|Dk) = p(Xif = x|Xobs(i) = xobs(i), C = ci)

=p(Xif = x,Xobs(i) = xobs(i)|C = ci)

p(Xobs(i) = xobs(i)|C = ci)(4.32)

where, obs(i) is the set indices of variables/features that are observed for

instance i, so Xobs(i) are features that are observed for that instance which

take on values xobs(i); and ci is the class label for instance i.

Therefore the conditional probability in Equation 4.32 can be written

in terms of the parameters of the joint distribution as

p(Xif = x|Dk) =

∑Mm αcimθcimfx

∏φ∈obs(i) θcimφxiφ∑M

m αcim

∏φ∈obs(i) θcimφxiφ

(4.33)

and Xif is the random variable describing the value of feature Xf on in-

stance i, and x is actual (observed) value.

64


4.6.7 Parameter Estimation

Since after each sampling step we only have a dataset with missing values

and not the parameters αs, θs and p(c) that describe our model, they

need to be estimated from the data. Once we have the estimates, the

conditional probabilities and feature relevances can be computed by using

the estimates in place of the parameters in Equations 4.33, 4.30 and 4.31.

We will now describe how these parameters are estimated.

Estimation of p(c) : Since we have class labels of all the records in the

dataset, the estimates of the class probabilities are obtained from the

(Laplace smoothed [11, 24, 46]) relative frequencies of the classes in the

data set.

Estimation of αs and θs : We need to estimate the parameters of the class-

conditional mixture distribution for all classes. Since we have class labeled

instances we can perform the estimation separately for each class, consid-

ering only the data from that particular class. We therefore suppress the

subscript c for the parameters corresponding to the class variable in the

following equations.

Expectation-Maximization

The expectation-maximization (EM) algorithm [13, 3] is a technique for

finding maximum likelihood solutions in probability models with latent

variables. A latent variable is a variable not directly observed but inferred

from observed variables through a mathematical model. As an example

related to our context, the αm coefficients of the mixture model are latent

variables.

Assume that D is a dataset without missing data, D = x1, . . . , xNT

and each record xi ∈ RF . We denote the set of latent variables by Z =

z1, . . . , zN . Assume then that the joint probability distribution over all

65


F variables of the dataset is parametrized by θ. The log-likelihood of the

joint distribution is then

ln (p(D|θ)) = ln∑

z

p(D,Z|θ) (4.34)

In order to find the maximum likelihood estimate of parameters θ we can

equivalently maximize the complete likelihood on the right side of Equa-

tion 4.34. If Z were known we could simply maximize directly the complete

likelihood to compute θs, which we assume to be straightforward. The

expectation-maximization algorithm provides a solution following these

steps:

1. At t = 0 guess an initial value for parameters θt=0.

2. Compute the expected value of the complete log-likelihood lc given

the current θt and data D, called Q(θ, θt):

Q(θ, θt) = EZ[lc(θ|D,Z)|D, θt]. (4.35)

The expectation is made over all possible values of the latent variables

Z.

3. Find θt+1 such that

θt+1 = arg maxθ∈Θ

Q(θ, θt) (4.36)

4. Check if the new values or the likelihood converged, otherwise

θt ← θt+1 (4.37)

and return to step 2.

Step 2 is called expectation step or E-step and step 3 is called maximization

step or M-step.

66


Dempster et al. [13] proved that the EM algorithm does not decrease

the observed likelihood at each step meaning that the algorithm converges

to a maximum. This does not guarantee that the algorithm converges to

the global maximum and in fact in case of a multimodal distribution the

EM algorithm converges just to a local maximum. Various heuristics has

been proposed to escape local maxima, like trying different random initial

guesses.

EM: Mixture of Product Distribution

Let Dc be the part of the dataset corresponding to class c. The data

likelihood in case of a mixture of product distributions is given by

l(Dc, θ) =N∑

i=1

logM∑

m=1

p(xi|αm, θm)p(αm) (4.38)

When the dataset Dc has no missing values the EM update equations

for θs can be shown to be

θt+1mfx =

∑Nn=1 δ(x, xnf)hnm∑N

n=1 hnm

(4.39)

αt+1m =

1

N

N∑

n=1

hnm (4.40)

where

hnm = E[Znm = 1|xn, θt] =

αm

∏Ff=1 θt

mjxnf∑Mm=1 αm

∏Ff=1 θt

mjxnf

(4.41)

See Appendix A.1 for the derivation of Equation 4.39.

EM and Missing Data

The EM algorithm has another important application: learning from datasets

with missing data [33, 19].

67


Assume that D can be divided in the observed part Do and the missing

part Dm, meaning that each record is made of some observed values and

some missing values, xi = (xoi , x

mi ), and the missingness pattern is record-

dependent. In this case the expectation of the E-step has to take into

account both the latent variables and the missing values

Q(θ, θt) = EZ,Dm[lc(θ|Do, Dm,Z)|Do, θt]. (4.42)

EM: Mixture of Product Distribution with Missing Data

Since in our problem there are missing values, we derived the EM update

equation for the mixture of product distribution obtaining

θt+1mfx =

∑Nn=1 ho

nm(θtmfx(1 − Inf) + δ(x, xnf)Inf

∑Nn=1 ho

nm

(4.43)

αt+1m =

1

N

N∑

n=1

honm (4.44)

where

honm = E[Znm|xobs(i)] =

αm

∏j∈obs(n) θ

tmjxnj∑M

m=1 αm

∏j∈obs(n) θ

tmjxnj

(4.45)

and where Inf = 1 when the feature f for record n is observed, otherwise

Inf = 0. Note that in the actual implementation of Equation 4.43 we per-

form Laplace smoothing to reduce estimation variance. See Appendix A.2

for the derivation of Equation 4.43.

4.6.8 Comparison with Application 1

Since many assumptions in the underlying probability model and the fea-

ture relevance evaluation algorithm are different between the two appli-

cations presented in this chapter we do not compare their results in this

research. See Chapter 6 for our future plans in making this comparison

more feasible.

68

CHAPTER 4. MAC: IMPLEMENTATION 4.7. COMPUTATIONAL ISSUES

4.7 Computational Complexity Issues

As mentioned at the beginning of this chapter the implementation of the

MAC algorithm is the source of many challenges. Some of them are about

the computational complexity of finding the design having the highest ben-

efit. Equation 4.2 is often difficult to maximize directly: Example 1 illus-

trated in Section 4.3 is one of the very few cases for which we were able to

find an explicit form of the design having maximum benefit. In almost all

cases the benefit value has to be computed for each design and the number

of designs could be excessively large. Consider, for example, an extension

of Application 2 where at a given sampling step there are N missing val-

ues: if the target is finding the most effective subset of any size, then the

number of designs is 2N . Often this number can be reduced using domain

knowledge. In all practical examples we met in our research and the liter-

ature, the number of designs which is meaningful to evaluate for a given

problem is much smaller due to domain constraints. Common examples of

such domain constraints on what is feasible to measure at each sampling

step were mentioned in Section 3.1. We report them here:

• Only one missing entry.

• All missing entries of a given pattern instance.

• A given number of missing entries (a batch).

• All missing entries of one variable.

With the help of this kind of constraints the problem of the number of

different designs becomes more tractable. But even in this case the com-

putation could require large resources.

Another strategy to reduce the number of designs to evaluate is based

on the fact that some designs could be equivalent, i.e. leading to the same

69


benefit4. In some cases it is possible to know in advance if two designs

are equivalent, avoiding to compute the same benefit twice. A simple

example, related to Application 1 (see Section 4.5) and Application 2 (see

Section 4.6) is when two pattern instances has exactly the same missing

variables and the values of the known variables are identical; due to the

independence assumption, sampling one pattern instance over a variable

whose value is currently missing, or sampling the other pattern instance

on the same variable will lead to exactly the same benefit.

A further strategy to reduce the number of designs to evaluate at a

given sampling step is to uniformly randomly select a subset of the designs

and compute the benefit and find the maximum just for this subset. Even

though the solution provided by this subsampling strategy is clearly sub-

optimal and no clear explanation is known of what is the effect of the

size of the subset with respect to the quality of the solution, we obtained

preliminary experimental evidences (see Section 5.3.3) that this strategy

is extremely effective. More research on this strategy is part of our future

activities.

4.8 Summary

In this Chapter we implemented the MAC sampling algorithm on four

different problems. We added further restrictions to the assumptions made

for the general sampling problem described in Chapter 3. These restrictions

reflect common constraints among the four problems examined. These new

restrictive assumptions allowed us to derive practical solutions.

The first problem (Example 1) deals with the simple game (Number)

of guessing a number in a sequence of attempts: the application of the

4Note that equivalent here means only that in a given sampling step the expected improvement of the

estimates of the target concept is the same, given the data already collected. At a different step, or with

different data previously collected, those designs could be not equivalent at all.

70

CHAPTER 4. MAC: IMPLEMENTATION 4.8. SUMMARY

MAC algorithm yields the binary search algorithm. The second problem

(Example 2) is about estimating the conditional probability of a binary

variable given another binary variable whose sampling we can control. The

third and fourth problems (Application 1 and 2) deal with estimating the

importance of new variables in prediction tasks. In Application 1 the MAC

algorithm, together with other algorithms from literature, is derived for the

target concept of learning the error rate of a MAP classifier while sampling

one new variable. In Application 2 the previous problem is extended to

the case where multiple variables are sampled and evaluated concurrently.

Experiments on the efficacy of sampling with the MAC algorithm with

respect to other policies are deferred to Chapter 5.

71


72

Chapter 5

Experiments

In this Chapter we describe the experiments conducted to support the

theoretical results and implementations shown in Chapter 3 and Chapter 4.

We start with experiments on Example 2 of Section 4.4 about estimating

the conditional probability between binary variables. Then we show results

about the two applications described in Section 4.5 and Section 4.6 on

evaluating new variables in a learning task.

We performed experiments using synthetic datasets and real life datasets.

Synthetic datasets allows to study the behavior of the proposed algorithms

in known and controlled settings; on these datasets the results are expected

to confirm the theory (in the limit of approximations made) or are used to

investigate boundary behavior. Datasets from real life problems are neces-

sary to assess how the method is able to generalize and perform in realistic

cases.

In all experiments we will start with no values acquired on the vari-

ables under investigation. At each sampling step one missing value will

be selected by the sampling policy and then measured to get the actual

value. The new value together with those already collected will be used to

improve the estimate of the target concept to be learned. After measur-

ing one value the number of unknown values will be reduced of one unit.

73

CHAPTER 5. EXPERIMENTS

The experiment ends when all missing values are disclosed. This means

that, for a dataset having initially N missing values, the sampling process

will consist of N sampling steps leading to N + 1 estimates of the con-

cept to learn (the additional estimate is made at the beginning, before any

measurement is made).

In the following Sections we introduce the experiments details: datasets

descriptions, experimental protocols and results. Final comments on re-

sults will be deferred to Chapter 6.

Note that these experiments describe the only data on which our method

was tested. A selection of the resulting plots was made in a few cases for

lack of space. The selection is meant to be representative of all the results

obtained.

Software implementation of the experiments and algorithms were made

in Python language on top of the opensource softwares NumPy [44] and

SciPy [25].

Performance Assessment and practical use. Every sampling policy consid-

ered in this research computes the benefit of sampling each missing value

at each sampling step1. Frequently many different missing entries share

exact the same benefit. Since we assume that the experimenter is allowed

to measure just one missing entry at a time, she picks one uniformly at

random. This random choice introduces a non-deterministic step in the

process. Since we are interested in investigating the expected performance

of the sampling policies, we need to repeat each experiment many times to

average out random fluctuations due to this random selection. The repe-

titions needed by the performance assessment task lead to a huge increase

of computation during experiments (usually a factor of 100 or 1000). If we

1in case of random sampling policy we can simply assume that it assigns an equal benefit to all missing

values at each step.

74

CHAPTER 5. EXPERIMENTS 5.1. CONDITIONAL PROB.

were interested in using a sampling policy just to make the actual decision

on what to sample next, without being interested in average performance

assessment, there would be no need to repeating each experiment. This

means that the practical use of an active sampling policy is more feasible

than doing performance assessment.

5.1 Estimation of Conditional Probabilities

We compare the performance of MAC and random sampling policies when

learning the conditional probabilities of two binary variables X1 and X2,

while sampling data incrementally, as described in Section 4.4.

We generate datasets of N = 100 instances; each dataset is defined by

three parameters:

• a = P (X2 = 0|X1 = 0)

• b = P (X2 = 1|X1 = 1)

• c = P (X1 = 0) = 1 − P (X1 = 1)

We considered 125 instances of the parameters (a, b, c) covering uniformly

the parameter space. The space of each parameter comprises 5 values:

0.1, 0.3, 0.5, 0.7, 0.9 and every combination of triplets from those values

is tested leading to 125 generated datasets and experiments.

The experiment proceeds as follows. Given a dataset and a sampling

policy we start with all values unknown, i.e., we hide all values and disclose

one at a time according to the requests of the sampling policy. The initial

estimates of conditional probabilities, i.e., a0 and b0, are then defined just

by the Beta(A, B) prior distribution as described in Equation 4.13. In these

experiments we assume that A = B = 1, meaning that a0 = b0 = 1/2. We

compute the root mean square (rms) difference between these estimates

and the true a and b values. At this point the sampling policy decides

75

5.1. CONDITIONAL PROB. CHAPTER 5. EXPERIMENTS

whether to sample at X1 = 0 or X1 = 1. The corresponding value of X2 is

disclosed and a full record < x1, x2 > is revealed in the dataset. Using this

new information, new estimates a1 and b1 are computed followed by the

new rms error. After this first step, iteratively, the sampling policy decides

again where to sample next: another new pair < x1, x2 > is disclosed and

new estimates of the conditional probabilities and of the rms error can be

computed. After N sampling steps all records in the dataset are disclosed,

the estimates of a and b converge to the true values2 and the experiment

ends.

For each experiment conducted we observe how the rms error between

the estimates of the parameters and the true values evolves at each sam-

pling step. See Figures 5.1, 5.2 and 5.3 as an example of these results.

We observe that the 125 results of the experiments can be summarized in

six groups as described in Table 5.1. We observe that results of experiments

having similar values of parameters a, b and c shows similar plots of the

rms error across the sampling steps.

Group# a b c

1 6= 1

26= 1

26= 1

2

2 6= 1

26= 1

2≃ 1

2

3 6= 1

2≃ 1

2> 1

2

4 6= 1

2≃ 1

2< 1

2

5 ≃ 1

2≃ 1

26= 1

2

6 ≃ 1

2≃ 1

2≃ 1

2

Table 5.1: Brief description of the six groups of results. In the first columns it is shown

the number of the group. In the following columns under a, b and c it is shown for each

group whether the value of the parameters is different, equal/close, greater or lesser than

1/2.

2Since the target of this experiment is to estimate a and b with a finite number of samples, we call

true values of a and b those obtained when all samples in the dataset are disclosed.

76


Representative plots of the rms errors at each sampling step for each of

the six groups are shown in Figures 5.1, 5.2 and 5.3.

In the case of random policy all missing entries are considered equivalent:

at each step one record (a pair < X1, X2 >) is selected uniformly at random

from the list of the unknown records and then disclosed. In case of the

MAC sampling policy the benefit of sampling at X1 = 0 or X1 = 1 has

been computed using Equations 4.14 and 4.15. Two benefits are considered

equivalent if they differs less than 10−9, to take into account numerical

instabilities.

Note that each experiment was repeated 1000 times to average out the

random fluctuations due to random choices of the sampling policies.

5.1.1 Detailed Results

The following six groups of results describes six different patterns of be-

havior when comparing the evolution of rms errors of random and MAC

sampling policies.

Group 1: a 6= 1/2 , b 6= 1/2, c 6= 1/2. In this group the true values of a and

b differs from the initial estimates a0 = b0 = 1/2. In case of a = b = 0.1

and c = 0.9 (see top panel of Figure 5.1) the random policy picks X1 = 0

more frequently since P (X1 = 0) > P (X1 = 1). This allows to reduce

quickly the estimation error on a but extremely slowly on b (See top panel

of Figure 5.4). It requires sampling nearly 40 records in order to reduce the

initial estimation error of 90%. MAC sampling policy outperforms random

policy requiring fewer that 20 records.

Group 2: a 6= 1/2 , b 6= 1/2, c ≃ 1/2. The behavior of MAC sampling policy

is identical to the one seen in Group 1 since this policy is not affected by

P (X1) (i.e., the value of c). Random policy P (X1) does not incur in the

77


0 20 40 60 80 100

Number of records sampled

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

r.m

.s. err

or

N=100 - a=0.1 - b=0.9 - c=0.9 - iter=1000

modnar

CAM

0 20 40 60 80 100


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

r.m

.s. err

or

N=100 - a=0.1 - b=0.9 - c=0.5 - iter=1000

modnar

CAM

Figure 5.1: The plots show the rms difference between the true conditional probabilities

and their estimates as the number of sampled records increases (averaged on 1000 rep-

etitions). The upper plot represents the typical behavior of Group 1 where the true

values of a, b and c are far from 1/2. The lower plot represents the behavior of Group 2

where a and b are far from 1/2 but c ≃ 1/2. MAC policy (•) is compared against random

policy(+). 78


0 20 40 60 80 100


0.00

0.05

0.10

0.15

0.20

0.25

0.30r.

m.s

. err

or

N=100 - a=0.1 - b=0.5 - c=0.9 - iter=1000

modnar

CAM

0 20 40 60 80 100


0.00

0.05

0.10

0.15

0.20

0.25

r.m

.s. err

or

N=100 - a=0.1 - b=0.5 - c=0.1 - iter=1000

modnar

CAM


and their estimates as the number of sampled records increases (averaged on 1000 repe-

titions). The upper plot represents the typical behavior of Group 3 where a if far from

1/2, b ≃ 1/2 and c > 1/2. The lower plot represents the behavior of Group 4 where

a is far from 1/2, b ≃ 1/2 and c < 1/2. MAC policy (•) is compared against random

policy(+). 79


0 20 40 60 80 100


0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

r.m

.s. err

or

N=100 - a=0.5 - b=0.5 - c=0.1 - iter=1000

modnar

CAM

0 20 40 60 80 100


0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

r.m

.s. err

or

N=100 - a=0.5 - b=0.5 - c=0.5 - iter=1000

modnar

CAM


and their estimates as the number of sampled records increases (averaged on 1000 repeti-

tions). The upper plot represents the typical behavior of Group 5 where the true values

of a = b = 1/2 and c is far from 1/2. The lower plot represents the behavior of Group

6 where a = b = c = 1/2. MAC policy (•) is compared against random policy(+).

80


penalty of sampling at X1 = 0 more than X1 = 1 since P (X1 = 0) =

P (X1 = 1) = c ≃ 1/2. The gain in performance of MAC policy becomes

almost negligible, showing that it approximates a policy that alternates

sampling at X1 = 0 to X1 = 1. See the bottom panel of Figures 5.1 and

5.4.

Group 3: a 6= 1/2 , b ≃ 1/2, c > 1/2. Knowing in advance that the initial

estimate of b is good, i.e., b0 ≃ b ≃ 1/2, a cheating sampling policy would

sample initially (and for some time) only at X1 = 0 in order to improve

immediately just the estimate of a. Since P (X1 = 0) = c > 1/2, the

number of records having X1 = 0 is greater than those having X1 = 1.

This unbalanced dataset force the random policy to behave as the cheating

policy just described. This is the reason why in the first sampling steps of

Figure 5.2 and 5.5 (top panels, a = 0.1, b = 0.5 and c = 0.9) random policy

is clearly more efficient than MAC policy. Since random sampling policy

samples X1 = 1 less frequently that MAC policy its squared difference

(bk − b)2 is greater on average after few sampling steps. For this reason

MAC sampling performs better at later sampling steps.

Group 4: a 6= 1/2 , b ≃ 1/2, c < 1/2. The behavior observed in this group of

experiments is exactly the opposite of Group 3: the random policy is the

opposite of the ideal (cheating) sampling policy. MAC policy is unaffected

by P (X1) and performs the same way as in Group 3. The result is analogous

as Group 1 when MAC policy outperforms random policy. See Figure 5.2

and 5.5 (bottom panels, a = 0.1, b = 0.5 and c = 0.1).

Group 5: a ≃ b ≃ 1/2, c 6= 1/2. Since the initial estimates a0, b0 are already

good, sampling new values can only increase the rms difference between

them and the true values. In the long term the unbalanced dataset drives

81


random policy to an average rms error in estimation bigger than MAC

policy (see Figure 5.3, top panel). The global behavior is similar to that

of Group 1 where MAC policy outperforms random policy. The difference

in the first sampling steps, i.e., random policy performs better than MAC,

is explained by the fact that random policy decreases more rapidly the

squared difference between the estimate and the true value of a (if c > 1/2,

otherwise b) since it samples X1 = 0 more frequently than MAC policy does

(See Figure 5.6, top panel). This aspect is analogous of the behavior seen

in group 3.

Group 6: a ≃ b ≃ c ≃ 1/2. Since random sampling policy is not biased

by the unbalanced dataset (P (X1 = 0) = P (X1 = 1) = c = 1/2) it

samples X1 = 0 and X1 = 1 equally (on average). We observe that the

difference between the two policies is negligible, showing again that the

MAC sampling policy approximates a policy that alternates sampling on

the two values of X1.

5.1.2 Summary of Results

As a final result we quantify and characterize in which cases MAC sampling

policy is better, equivalent or (partially) worse than random policy across

the 125 generated datasets. In Table 5.2 we show the figures.

In exactly zero cases the MAC policy is consistently worse than random

policy across all sampling steps. In a few cases (Group 3 and 5) the random

policy estimates the conditional probability more efficiently than MAC

policy, but only in the very first steps (always before step 15). After

these initial steps, MAC policy is definitely more efficient than random

policy. These cases are those of Group 3 characterized by b = 0.5,

c = 0.1, 0.3 (10 cases), plus those of Group 5 where a = 0.5, b = 0.5

c = 0.1, 0.3, 0.7, 0.9 (4 cases, but 2 in common with group 3).

82


0 20 40 60 80 1000.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50a

N=100 - a=0.1 - b=0.9 - c=0.9 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.5

0.6

0.7

0.8

0.9

b

0 20 40 60 80 1000.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

a

N=100 - a=0.1 - b=0.9 - c=0.5 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.4

0.5

0.6

0.7

0.8

0.9

1.0

b

Figure 5.4: Plots of the average estimates of parameters a and b while sampling new

data. MAC algorithm (•) is compared to random algorithm (+). Error bars, representing

one standard deviation, are plotted. Results of the case a = 0.1, b = 0.9 and c = 0.9

(representing Group 1) are plotted on top panel. Results for a = 0.1, b = 0.9 and c = 0.5

(representing Group 2) are on bottom panel. Average and error bars are computed over

1000 repetitions of the experiments. 83


0 20 40 60 80 1000.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

a

N=100 - a=0.1 - b=0.5 - c=0.9 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

b

0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

a

N=100 - a=0.1 - b=0.5 - c=0.1 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

b








0 20 40 60 80 1000.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70a

N=100 - a=0.5 - b=0.5 - c=0.1 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

b

0 20 40 60 80 1000.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

a

N=100 - a=0.5 - b=0.5 - c=0.5 - iter=1000

modnar

dts1±modnar

CAM

dts1±CAM

0 20 40 60 80 100


0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

b







5.2. SINGLE REL.EST. CHAPTER 5. EXPERIMENTS

MAC algorithm and random algorithm behave equivalently in Group 2

and Group 6. Group 2 is characterized by c = 0.5 (25 cases). In Group

6 just by a = 0.5, b = 0.5 and c = 0.5 (1 case, already included in

Group 2).

In all other configurations of the experiments (88 cases) MAC sampling

algorithm outperforms random sampling.

MAC performance # cases % of 125 description

always worse 0 0% never

worse, than better 12 10% b = 0.5, c = 0.1, 0.3, a = 0.5, b = 0.5, c 6= 0.5

equivalent 25 20% c = 0.5

better 88 70% all other cases

Table 5.2: Summary of the results about the comparison of MAC sampling algorithm

vs. random policy over 125 generated datasets. The first column indicates all possible

outcomes of the comparison: “always worse”, i.e., MAC algorithm is always less efficient

than random policy across all sampling steps, “worse, than better”, i.e., MAC is less

efficient just in the first sampling steps (less than 15 steps) and then becomes more

efficient; “equivalent”, i.e the difference between the two policies always is negligible;

“better”, i.e., MAC algorithm is consistently better than random algorithm across all

steps. In the second columns is indicated the number of cases falling in each class; in the

third the percentage over the 125 cases; in the last column each class is described briefly.

5.2 Single Feature Relevance Estimation

We evaluate the performance of MAC sampling policy compared to other

sampling policies in a learning task. The target is to estimate the contri-

bution of adding a new variable X when building a classifier on a class

labelled dataset. Each instance in the dataset is described by the class

label C and some variables X. See Section 4.5 for details.

We conducted experiments on many datasets: synthetic data generated

according to the distributions presented in Figure 5.7, on several datasets

86

CHAPTER 5. EXPERIMENTS 5.2. SINGLE REL.EST.

from the University of California - Irvine (UCI) repository [42], on data

from agriculture describing Apple Proliferation disease and on data from

biomedical experiments on cancer biomarkers.

We compare five sampling policies described in Section 4.5.3:

1. random

2. MAC

3. IME

4. SFL with max-depth = 1

5. SFL with max-depth = 5.

The Single Feature Lookahead (SFL) algorithm was not investigated at

higher max-depth values because the performance was comparable to the

case with max-depth = 5, but required excessive computation.

The evaluation metric is computed as follows. For each choice of the

candidate feature X we calculated the expected error rate ǫfull of a max-

imum a posteriori classifier trained on the entire database (i.e., with all

the values of C, X and X known). Then for a given sample size L we

sampled X values on L samples from the database (by each sampling pol-

icy) and calculate the predicted error rate ǫL for each method. We then

computed the root mean square difference between ǫL and ǫfull over several

runs of the sampling scheme. Under the assumption of unit cost for feature

value acquisition the rms difference will measure the efficacy of a sampling

scheme in estimating the error rate of a classifier trained on both X and

X as a function of the number of feature values acquired.

We note that ǫfull can be viewed as the true error rate of the classifier

that uses the new variable X and ǫL as the estimate of that error rate

after sampling X on L samples. Since our goal is to predict the error rate

87


accurately minimizing L, we can measure the effectiveness of our sampling

algorithm by the rms difference between ǫfull and ǫL.

In the following we describe each dataset in detail and provide results

of the experiments made.

~X1~X2

class1 class0

0 1 2

class1

class0

0

1

0 1 2

2

1

0

0 1 2

class1 class0

0 1 2

class1

class0

X X

XX

Figure 5.7: True class-conditional (X, X1) and (X, X2) variables distributions. The data

points before and after measuring the candidate features are also shown.

5.2.1 Synthetic Data

This dataset implements the introductory example of Section 1.1.1. We cre-

ated two datasets of N = 80 instances (or records) each with two columns

corresponding to the class label C and the variable X plus a third column

corresponding to the new unknown variable Xi, i = 1, 2. The values in the

datasets follows the distributions shown in Figure 5.7. This means that

the classifier built on (C, X) has a non null error rate, that X1 does not

improve accuracy of the classifier and X2 does improve classification accu-

racy. X1 is called useless feature and X2 useful feature. The variables were

discretized (indicated by dashed lines in Figure 5.7). For each dataset and

each sampling policy we performed one experiment as described before.

88


The plots of the rms difference as a function of the number of samples

probed (L) are shown in Figure 5.8. The top plot shows the average rms

errors in the error rates estimated from data obtained by random and ac-

tive sampling algorithms for feature X1 and the bottom plot is for feature

X2. The averages are computed over the 1000 repetitions of each exper-

iment. In each plot, to compare the different sampling schemes for cost

effectiveness, we must compare the number of feature values sampled for

a required rms error. The value of the rms error at L = 0 indicates the

utility of the new feature X, i.e., the total reduction in error rate by the

adding the new feature.

In both cases shown in Figure 5.8 MAC policy outperforms all other

policies, except IME in the case of the useless variable. Note that for X1

(Figure 5.8, top panel) the rms difference climbs from zero to a higher

value before converging to zero again because when no samples have been

extracted the feature is deemed useless (which it really is). With a few

samples, however, the estimate of the relevance is incorrect.

5.2.2 UCI Benchmark Data

We experimented with several UCI machine learning datasets with cate-

gorical features and class labels. The aim of using these datasets is testing

the sampling policies on benchmark data frequently used by the Machine

Learning community. Here is a list with a brief description of each:

• Solar Flare: data collected counting the number of solar flares of a

certain class that occur in a 24 hour period. Number of instances:

1389. Number of attributes: 10.

• Balance Scale: data generated to model psychological experimental

results. Each example is classified as having the balance scale tip to

the right, tip to the left, or be balanced. Number of instances: 625.

89


0 10 20 30 40 50 60 70 80

sample size (L)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

rms d

iffe

rence

Synthetic Data X_new=1 (Useless)

randomMACIMESFL d=1SFL d=5

0 10 20 30 40 50 60 70 80

sample size (L)

0.00

0.02

0.04

0.06

0.08

0.10

0.12

rms d

iffe

rence

Synthetic Data X_new=2 (Useful)


Figure 5.8: Synthetic Data. The root mean square difference between the estimated and

true error rate after the candidate feature is added as a function of the number of samples

probed for all sampling policies. Rms difference is averaged over 1000 repetitions of the

experiment.

90


Number of attributes: 4

• MONKS’s problem: artificially generated data for the first interna-

tional comparison of learning algorithms [57]. Number of instances:

432. Number of attributes: 7.

• Breast Cancer Wisconsin: data describing histological information on

benign and malignant breast cancer samples. Number of instances:

699 (with missing values). Number of attributes: 10.

• Mushroom: mushrooms described in terms of physical characteristics

and class labelled as poisonous or edible. Number of instances: 8124.

Number of attributes: 22.

• Zoo: artificial dataset describing 7 classes of animals. Number of

instances: 101. Number of attributes: 17.

For each dataset we performed the sampling experiment for some config-

urations of randomly chosen pairs of features. From each pair, the first

variable was used as known feature X and the second for the candidate

feature X3. For each configuration and sampling method we plotted the

rms difference between the true and estimated error rate for different num-

ber of acquired samples. Results are shown in Figures 5.9, 5.10, 5.11,

5.12, 5.13 and 5.14. Since the computational complexity of active sam-

pling algorithms is at least linear in the number of instances, the number

of experiments we performed is bigger for small datasets and small feature

space. For the same reason the experiments on larger datasets end before

acquiring all samples.

For lack of space we present only a selection of the plots that are most

representative in Figures 5.9, 5.10, 5.11, 5.12, 5.13 and 5.14. As those

3For few cases, we chose three features and used the first two as already known features, and the last

one as candidate feature.

91


Figures shows, the MAC sampling policy performs better than random

policy in every case and better that any other policy in the majority of the

cases. MAC policy is always better for assessing the relevance of useless

variables.

5.2.3 Data from Agriculture Domain

We evaluated the different sampling strategies on a small dataset provided

by the biologists which, after preprocessing, has 520 instances (apple trees)

with one binary class variable: the presence or absence of the phytoplasma

determined by the Enzyme-Linked ImmunoSorbent Assay (ELISA) chem-

ical test [15, 52]. Three binary features were considered:

• Presence of witches brooms on the trees, i.e., a proliferation of sec-

ondary shots near the apex of the main shoot, in summer.

• Reddening of leaves.

• Presence of enlarged stipulae.

We performed the sampling experiments using the six possible combi-

nations of pairs of features. For each pair the first feature was used as

previously known feature X and the second as new feature X . In Table 5.3

we show the number of values of the candidate feature that have to be

acquired for the rms difference between the estimated and true error rate

to be less than 0.005. For comparison we note that the true error rate is

calculated using all 520 samples.

An example of the results is plotted in Figure 5.15 showing the per-

formance of the sampling policies when the presence of witches brooms is

known and the reddening of leaves is acquired incrementally. In all cases

of these experiments IME and SFL (d = 5) sampling policies performed

worse than random policy. In all six cases MAC policy outperforms all

other policies.

92


0 50 100 150 200 250 300 350

sample size

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045rm

s d

iffe

rence

UCI Solar Flares dataset : X = 3 - X_new = 4


0 50 100 150 200 250 300 350

sample size

0.00

0.02

0.04

0.06

0.08

0.10

0.12

rms d

iffe

rence

UCI Solar Flares dataset : X = 7 - X_new = 8


Figure 5.9: Solar Flares dataset. The average root mean square difference between the

estimated and true error rate after the candidate feature is added as a function of the

number of samples probed for all sampling policies. Rms difference is averaged over 100

repetitions of the experiment.

93


0 50 100 150 200 250 300

sample size

0.00

0.01

0.02

0.03

0.04

0.05

0.06

rms d

iffe

rence

UCI Balance dataset : X = 1 - X_new = 3


0 20 40 60 80 100

sample size

0.01

0.02

0.03

0.04

0.05

0.06

rms d

iffe

rence

UCI Balance dataset : X = 3 - X_new = 2


Figure 5.10: Balance Scale dataset. The average root mean square difference between the



repetitions of the experiment. Due to the large size of the feature space only the first 100

samples are acquired. 94


0 50 100 150 200 250 300 350 400 450

sample size

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035rm

s d

iffe

rence

UCI Monks dataset : X = 5 - X_new = 6


0 50 100 150 200 250 300 350 400 450

sample size

0.00

0.02

0.04

0.06

0.08

0.10

0.12

rms d

iffe

rence

UCI Monks dataset : X = 6 - X_new = 2


Figure 5.11: Solar Flares dataset. The average root mean square difference between the



repetitions of the experiment.

95


0 50 100 150 200

sample size

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

rms d

iffe

rence

UCI Breast Cancer dataset : X = 3 - X_new = 4

randomMACIMESFL d=1

0 50 100 150 200

sample size

0.015

0.020

0.025

0.030

0.035

0.040

rms d

iffe

rence

UCI Breast Cancer dataset : X = 5 - X_new = 3

randomMACIMESFL d=1

Figure 5.12: Breast Cancer Wisconsin dataset. The average root mean square difference

between the estimated and true error rate after the candidate feature is added as a function

of the number of samples probed for all sampling policies. Rms difference is averaged over

100 repetitions of the experiment. Due to the large size of the feature space only the first

100 samples are acquired. 96


0 20 40 60 80 100

sample size

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30rm

s d

iffe

rence

UCI Mushroom dataset : X = 21 - X_new = 5


0 20 40 60 80 100

sample size

0.00

0.05

0.10

0.15

0.20

0.25

rms d

iffe

rence

UCI Mushroom dataset : X = 15 - X_new = 6


Figure 5.13: Mushroom dataset. The average root mean square difference between the



repetitions of the experiment. Due to the large size of the feature space only the first 100

samples are acquired. 97


0 20 40 60 80 100 120

sample size

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

rms d

iffe

rence

UCI Zoo dataset : X = 13 - X_new = 15


0 20 40 60 80 100 120

sample size

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

rms d

iffe

rence

UCI Zoo dataset : X = 2,11 - X_new = 1


Figure 5.14: Zoo dataset. The average root mean square difference between the estimated

and true error rate after the candidate feature is added as a function of the number of

samples probed for all sampling policies. Rms difference is averaged over 100 repetitions

of the experiment.

98


0 100 200 300 400 500 600

sample size (L)

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

rms d

iffe

rence

S.M.A.P. project : X = 1 - X_new = 2


Figure 5.15: Apple Proliferation dataset. The average root mean square difference be-

tween the estimated and true error rate after the candidate feature is added as a function


100 repetitions of the experiment.

99


Features Random MAC IME SFL SFL

X X (d = 1) (d = 5)

1 2 414 379 470 420 463

1 3 394 308 467 398 470

2 1 444 332 483 444 485

2 3 393 320 462 401 464

3 1 445 333 484 443 486

3 2 414 394 464 420 458

Table 5.3: Agriculture Data. The number of samples required (out of a total of 520)

for the rms difference between the true and estimated error rate to be less than 0.005

for various sampling algorithms. Each row corresponds to one selection of the previous

feature X and the candidate feature X. The rms values are computed over 1000 runs.

5.2.4 Data from Biomedical Domain

To provide evidence that our method is effective in reducing costs in the

biomedical domain we experimentally evaluated our method on a breast

cancer Tissue Microarray dataset4. Although the biomarker evaluation

problem is not relevant for this particular dataset we use it to demonstrate

the utility of our approach.

The dataset was acquired using the recently developed technique of

Tissue Microarray [28] that improves the in-situ experimentation process

by enabling the placement of hundreds of samples on the same glass slide.

Core tissue biopsies are carefully selected in morphologically representative

areas of original samples and then arrayed into a new ”recipient” paraffin

block, in an ordered array allowing for high-throughput in situ experiments.

For each patient there is a record that describes her clinical, histological

and biomarkers information. The entire dataset consisted of 400 records

defined by 11 features. Each of the clinical features is described by a binary

4The data used for experimentation was collected by the Department of Histopathology and the

Division of Medical Oncology, St. Chiara Hospital, Trento, Italy. Tissue Microarray experiments were

conducted at the Department of Histopathology [12].

100


status value and a time value. Some of the records have missing values.

The data are described by features as in Table 5.4.

Clinical Features

1. the status of the patient (binary, dead/alive) after a certain amount of time

(in months, integer from 1 to 160)

2. the presence/absence of tumor relapse (binary value) after a certain amount

of time (in months, integer from 1 to 160 months)

Histological Features

3. diagnosis of tumor type made by pathologists (nominal, 14 values)

4. pathologist’s evaluation of metastatic lymphonodes (integer valued)

5. pathologist’s evaluation of morphology (called grading, ordinal, 4 values)

Biomarkers Features (manually measured by experts in TMA)

6. Percentage of nuclei expressing ER (estrogen receptor) marker.

7. Percentage of nuclei expressing PGR (progesterone receptor) marker.

8. Score value (combination of color intensity and percentage of stained area

measurements) of P53 (tumor suppressor protein) maker in cells nuclei.

9. Score value (combination of col-our intensity and percentage of stained area

measurements) of cerbB marker in cells membrane.

Table 5.4: Features describing biomarkers data.

The learning task defined on this dataset is the prediction of the status

of the patient (dead/alive or relapse) given some previous knowledge (his-

tological information or known biomarkers). The goal is to choose the new

biomarker which can be used along with the histological features that pro-

vides accurate prediction. The experiments address the issue of learning

which additional feature has to be sampled.

The dataset was preprocessed as follows. Continuous features were dis-

cretized to reduce the level of detail and to narrow the configuration space

for the sampling problem. Discretized feature were encoded into binary

variables according to the convention suggested by experts in the domain.

We designed 10 experiments corresponding to different learning situa-

tions. The experiments differ in the choice of attribute for the class label

101


(C), the attributes used as the previous features (X) and the feature used

as the new candidate feature (X). The various configurations are shown

in Table 5.5.

Exp. Class Label (C) Known Features (X) New Feature (X) Size (#)

I dead/alive all histological information PGR 160

II dead/alive all histological information P53 164

III dead/alive all histological information ER 152

IV dead/alive all histological information cerbB 170

V relapse all histological information PGR 157

VI relapse all histological information P53 161

VII relapse all histological information ER 149

VIII relapse all histological information cerbB 167

IX dead/alive PGR, P53, ER cerbB 196

X relapse PGR, P53, ER cerbB 198

Table 5.5: Configurations of experiments on biomarkers data.

For the empirical evaluation we performed an additional preprocessing

step of removing all the records with missing values for each experiment

separately. For this reason the sizes of datasets used for different experi-

ments are different.

Experiments and Results

For each of the 10 experimental configurations described above, all sam-

pling policies are compared for different number of acquired samples.

For each experiment we plotted the average rms value against the num-

ber of samples probed which are shown in Figure 5.17 5.18 5.19 5.20 5.21.

The average is performed over 500 repetitions of each experiments. In each

plot, to compare the MAC sampling scheme to the other methods for cost

effectiveness, we must compare the number of feature values sampled for

102

CHAPTER 5. EXPERIMENTS 5.3. MULTIPLE FEAT.REL.

a required rms error.

We observe from the plots that our MAC active sampling algorithm is

significantly better in reducing the number of samples needed for error rate

estimation than all other sampling policy in almost all cases.

In terms of the biomedical problem, this implies that using the MAC ac-

tive sampling, we can evaluate a higher number of biomarkers on the same

amount of bio-sample resource than using the standard random sampling

method or other methods available in literature.

5.3 Multiple feature relevance estimation: experi-

ments

We evaluate the performance of MAC sampling policy compared to the

random sampling policy in a learning task. The target is to assess the

relevance of a set of variables with respect to predicting the class labels

describing a set of instances. The relevance assessment is done on all

variables while sampling them concurrently as described in Section 4.6.

We conducted experiments on synthetic data and on datasets from the

University of California - Irvine (UCI) repository [42]. We plan to conduct

experiments on datasets from apple proliferation disease in agriculture and

from biomarkers in medicine (see Section 4.5.1) only as a future activity.

For a particular dataset, the experimental setup is as follows. We start

with the assumption that the class labels for all the samples are initially

known and all of the feature values are missing. At each sampling step a

single missing entry in the dataset is selected by the sampling policy and

the actual value in the dataset is disclosed. The experiment ends when

all entries of the dataset are sampled and all the original feature values

are fully disclosed. After each sample is disclosed we estimate the feature

relevances from all the data that is currently available, which are compared

103

5.3. MULTIPLE FEAT.REL. CHAPTER 5. EXPERIMENTS

0 20 40 60 80 100 120 140 160

sample size (L)

0.000

0.005

0.010

0.015

0.020

0.025

rms d

iffe

rence

Biomarkers exp.1


0 20 40 60 80 100 120 140 160 180

sample size (L)

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

rms d

iffe

rence

Biomarkers exp.2


Figure 5.16: Biomarkers experiment I and II. The average root mean square difference




104


0 20 40 60 80 100 120 140 160

sample size (L)

0.000

0.005

0.010

0.015

0.020

0.025rm

s d

iffe

rence

Biomarkers exp.1


0 20 40 60 80 100 120 140 160 180

sample size (L)

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

rms d

iffe

rence

Biomarkers exp.2


Figure 5.17: Biomarkers experiment I and II. The average root mean square difference




105


0 20 40 60 80 100 120 140 160

sample size (L)

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

rms d

iffe

rence

Biomarkers exp.3


0 20 40 60 80 100 120 140 160 180

sample size (L)

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

rms d

iffe

rence

Biomarkers exp.4


Figure 5.18: Biomarkers experiment III and IV. The average root mean square difference




106


0 20 40 60 80 100 120 140 160

sample size (L)

0.000

0.005

0.010

0.015

0.020

0.025

0.030rm

s d

iffe

rence

Biomarkers exp.5


0 20 40 60 80 100 120 140 160 180

sample size (L)

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

rms d

iffe

rence

Biomarkers exp.6


Figure 5.19: Biomarkers experiment V and VI. The average root mean square difference




107


0 20 40 60 80 100 120 140 160

sample size (L)

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

rms d

iffe

rence

Biomarkers exp.7


0 20 40 60 80 100 120 140 160 180

sample size (L)

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

rms d

iffe

rence

Biomarkers exp.8


Figure 5.20: Biomarkers experiment VII and VIII. The average root mean square differ-

ence between the estimated and true error rate after the candidate feature is added as

a function of the number of samples probed for all sampling policies. Rms difference is

averaged over 500 repetitions of the experiment.

108


0 50 100 150 200

sample size (L)

0.000

0.001

0.002

0.003

0.004

0.005

0.006rm

s d

iffe

rence

Biomarkers exp.9


0 50 100 150 200

sample size (L)

0.000

0.005

0.010

0.015

0.020

0.025

rms d

iffe

rence

Biomarkers exp.10


Figure 5.21: Biomarkers experiment IX and X. The average root mean square difference




109


to the true feature relevance values (the feature relevances estimated from

the entire dataset). The comparison measure is the average rms of errors,

which is plotted as a function of the number of missing entries filled thus

far. The average is computed over 100 sampling runs to reduce fluctuations

introduced by random selection of entries, in case of multiple equivalent

choices occurring at certain steps. The plots show the comparison of our

active sampling algorithm (MAC) to the random sampling algorithm.

Although the models we presented are general, we only experimented

with mixture distributions (cf. Section 4.6.4) of only one component per

class (i.e., a Naıve Bayes model). We did not perform experiments with a

higher number of components because of estimation problems during the

initial sampling steps and also because of computational issues. In the

future we intend to develop methods to adjust the number of components

depending on the amount of data available at any sampling step.

5.3.1 Synthetic Data

We now describe how the synthetic dataset was generated. We created a

dataset of size N = 200 samples with binary class labels and three binary

features with exactly 100 records per class (i.e., p(c = 0) = p(c = 1) =

0.5). The features are mutually class-conditionally independent and with

different relevances to the class labels.

The feature values are generated randomly according to the following

scheme. For feature Xi we generate the feature values according to the

probability P (Xi = 0|C = 0) = p(Xi = 1|C = 1) = pi. Clearly if pi is

closer to 0 or 1 the feature is more relevant for classification than if pi

is closer to 0.5. For our three features we chose p1 = 0.9, p2 = 0.7 and

p3 = 0.5 meaning that the first feature is highly relevant and the third is

completely irrelevant for classification. The true feature relevances (mutual

information values) are r1 = 0.37, r2 = 0.08 and r3 = 0 respectively.

110


Since by construction there is no inter-feature dependence given the

class, we conducted experiments using a product distribution for each class

(i.e., a mixture of just one component). The average rms distance between

the estimated and true feature relevances is plotted as function of the

number feature values sampled in Figure 5.22 for both the random and

MAC sampling policies.

The graph in Figure 5.22 shows that our proposed active scheme clearly

outperforms the random acquisition policy. For example note that in order

to reduce the difference between estimated and true relevances to a fourth

of the initial value (when all feature values are missing) the random policy

requires 45 samples instead of 30 by MAC policy.

In Figure 5.23 we show, separately, the estimates of each of the indi-

vidual feature relevances. In Figure 5.24 we show the average number of

times each feature is sampled as a function of the number of samples. We

observe that the frequency with which a feature is sampled is correlated

to its relevance. This is a desirable property because the least relevant

features will eventually be discarded and therefore sampling them would

be wasteful.

5.3.2 UCI Datasets

We performed experiments on the Zoo, Solar Flares, Monks and Cars

datasets from UCI repository. See Section 5.2.2 for a brief description

of them. These datasets present larger class label spaces (from 2 to 6

classes) and an increased number of features (from 6 to 16). Also some of

the features take on more values (from 2 to 6 values) than our artificial

datasets. Figures 5.25 and 5.26 show the plots of the average rms error

between the estimated and ’true’ feature relevances as a function of the

number of samples acquired for both the MAC and the random sampling

schemes. The error values are normalized such that at step 0 (i.e, when

111


0 20 40 60 80 100

sample size

0.10

0.15

0.20

0.25

0.30

0.35

0.40

rms d

iffe

rence

Synthetic

randomactive

Figure 5.22: Average rms differences between estimated and true relevances at each sam-

pling step on artificial data for random and MAC policies. Average performed over 100

repetitions of the experiment. Only the first 100 sampling steps are shown to improve

readability. Note that true relevances are those computed from the full dataset, and not

the theoretical ones which are slightly different.

112


0 100 200 300 400 500 600

sample size

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Rele

vance

feature 1

feature 2

feature 3

Synthetic

randomactive

Figure 5.23: Estimated relevances at each sampling step for every single feature on ar-

tificial data. Random (dashed line) and MAC (solid-dotted line) policies are compared.

Since there are three features and 200 instances the x axis goes to 600.

none of the missing entries has been filled) the error is 1.0.

Figures 5.25 and 5.26 illustrate the advantage of the active sampling

policy over the random scheme in reducing the number of feature samples

necessary for achieving comparable accuracy. We note that in order to

reduce the estimation error of feature relevances to one fourth of the initial

value, the number of samples required is 25% – 75% lower for the MAC

policy than for the random policy. Again, we observed in all datasets

that most relevant features are sampled more frequently than less relevant

features.

5.3.3 Computational Complexity Issues

The computational complexity of MAC sampling algorithm due to the ex-

pensive EM estimation (which is repeated for every missing entry, every

possible feature value and every iteration) limits its applicability to large

113


0 100 200 300 400 500 600

sample size

0

50

100

150

200

250

Sam

pling C

ounts

Synthetic

feature 1,2,3 - random

feature 1 - activefeature 2 - activefeature 3 - active

Figure 5.24: Average cumulative sampling counts at each sampling step for each feature

on artificial data. The more relevant features are sampled more frequently than less

relevant features in case of MAC policy. As a comparison, the random policy samples

features independently of their relevance.

114


0 50 100 150 200 250

sample size

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0rm

s d

iffe

rence

Zoo

randomactive

0 50 100 150 200 250

sample size

0.2

0.4

0.6

0.8

1.0

1.2

rms d

iffe

rence

Monks

randomactive

Figure 5.25: The normalized difference between final relevances and estimated relevances

at each sampling step is plotted for random and MAC policies on Zoo dataset (top panel)

and Monks datasets (bottom panel). The value at step 0 (all feature values unknown) is

normalized to 1.0.

115


0 50 100 150 200 250

sample size

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

rms d

iffe

rence

Solar Flares

randomactive

0 50 100 150 200 250

sample size

0.4

0.6

0.8

1.0

1.2

rms d

iffe

rence

Cars

randomactive

Figure 5.26: The normalized difference between final relevances and estimated relevances

at each sampling step is plotted for random and MAC policies on Solar Flares dataset

(top panel) and Cars dataset (bottom panel). The value at step 0 (all feature values

unknown) is normalized to 1.0.

116


datasets. One way we reduced the computational expense was to memo-

ize5 the calculation of the benefit function for equivalent entries (i.e., en-

tries having same non-missing feature values, thus having the same benefit

value). Another strategy to reduce computation is to perform sub-optimal

active sampling by considering only a random subset of the missing entries

at each time step. This latter strategy can be used to trade-off sampling

cost versus computational cost.

In Figure 5.27 (upper panel) the MAC and random policies are shown

together with MAC policy that considers 0.1% and 1% of the missing en-

tries (randomly selected) at each sampling step; results are based on the

artificial dataset described in Section 5.3.1. We observe that the dominance

of MAC policy compared to the random one increases with the subsam-

ple size. A similar experiment was performed on Solar Flares dataset (see

Figure 5.27, bottom panel) where MAC and random policies are plotted

together with MAC policy that considers 0.05%, 0.25% and 1% of the miss-

ing entries (randomly selected). Again we observe that performing MAC

policy considering a random sub-portion of the dataset (0.25% of the total

number of missing entries at any instance) is an effective strategy to obtain

a reduction in the number of samples acquired at reduced computational

cost.

More investigations on this preliminary evidence is needed in order to

confirm and understand better this desirable property.

5Memoization [40, 43] is a programming technique that stores the input and the output of a deter-

ministic function in a hash table. The purpose is to avoid the exact same computation multiple times, for

performance reasons. The next time that same computation will be requested, the result will be retrieved

from the hash table instead of computing it again.

117


0 20 40 60 80 100

sample size

0.10

0.15

0.20

0.25

0.30

0.35

0.40

rms d

iffe

rence

Synthetic

randomactive 0.1%active 1%active

0 50 100 150 200 250

sample size

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

rms d

iffe

rence

Solar Flares

randomactive 0.05%active 0.25%active

Figure 5.27: Average squared sum of the differences between estimated and true relevances

at each sampling step on artificial and UCI Solar Flares datasets. Random and MAC

policies are compared to the active active policy that considers only a small random

subset of the missing entries at every sampling step.

118

CHAPTER 5. EXPERIMENTS 5.4. SUMMARY

5.4 Summary

In this chapter we experimentally evaluated the efficacy of the MAC sam-

pling algorithm with respect to other policies on three problems described

in Chapter 4 (i.e., Example 2, Application 1 and 2). Example 2, estima-

tion of the conditional probability, was tested on 125 synthetic datasets.

MAC sampling policy outperformed the random sampling policy in most

of the cases; in all other cases they showed an equivalent behavior. In

Application 1, assessing the relevance of a new variable, the comparison

of MAC algorithm to other methods (random sampling and others from

the literature) was carried out on multiple datasets, from a simple syn-

thetic dataset to more complex cases. The latter are benchmark datasets

(from UCI repository [42]) and two real life domains (agricultural data and

biomarkers data). In all cases MAC algorithm outperformed random sam-

pling and proved to be definitely the best compared to other algorithms. In

Application 2, assessing the relevance of multiple variable concurrently, the

comparison of MAC algorithm and random algorithm was tested on syn-

thetic and UCI datasets. MAC algorithm showed a significant reduction

in the number of data to acquire to reach the same accuracy as random

sampling. Preliminary evidence of the efficacy of a technique to reduce the

computational costs of the MAC policy was also described.

119

5.4. SUMMARY CHAPTER 5. EXPERIMENTS

120

Chapter 6

Conclusions and Directions for

Future Work

In this Chapter we summarize the main achievements of this research and

illustrate new directions for future work.

6.1 Main Contributions

In Chapter 1 we introduced the problem of active feature sampling using

two examples on feature relevance estimation. A generalization of the

problem was presented as well. We motivated this work on the basis of two

domain of application: studying apple proliferation disease in agriculture

and biomarkers for cancer characterization in medicine.

Maximum Average Change (MAC) Algorithm. In Chapter 3 we introduced

a novel sampling algorithm derived from the theory of Bayesian experimen-

tal design. It rates each sampling candidate according to its expected con-

tribution to estimating a target concept. The algorithm, called Maximum

Average Change (MAC) sampling algorithm, is based on the computation

of a benefit function, which estimates the squared difference between the

current MMSE estimate of the concept and the one at the next sampling

121

6.1. CONTRIBUTIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK

step. In Section 3.3 we prove that by maximizing this quantity we mini-

mize the mean quadratic error of the current estimate with respect to the

true value of the concept. As a result, from the hard problem of comput-

ing expectations over the true value of the target concept, we move to a

simpler problem of averaging over the possible results at the next sampling

step.

Information Theoretic Interpretation. In Section 3.3.1 we provided an in-

tuitive interpretation of MAC sampling algorithm in terms of information

theory. Intuitively, the most interesting missing entry to sample is the one

that will give most information with respect to the target concept and the

data collected so far. Under the approximation that maximizing the dis-

tance between two distributions is equivalent to maximizing the distance

between their expected values we obtain the MAC algorithm.

Example 1: Learning a Step Function. As a first implementation of the

MAC algorithm we derived its application to the game of guessing a num-

ber, iteratively. In Section 4.3 we showed that the familiar binary search

algorithm maximizes the benefit function derived from the problem, an

indication of the MAC algorithm’s effectiveness.

Example 2: Estimation of Conditional Probability. A second implementa-

tion of the MAC sampling algorithm was derived for the problem of es-

timating the conditional probability between two binary variables. Full

derivation of the benefit functions was provided together with a brief anal-

ysis of their properties. See Section 4.4.

Application 1: Single Feature Relevance Estimation. The first main appli-

cation of the MAC algorithm, that actually initiated our interest in this

122

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.1. CONTRIBUTIONS

topic, deals with the estimation of the error rate reduction caused by a

new variable when added to a class labelled dataset. We present this ap-

plication with a more detailed introduction to the problem of biomarkers

for cancer characterization and the data collection issues of studying the

apple proliferation phytoplasma. In Section 4.5, efficient assessment of the

error rate reduction, i.e., using as few measurements of the new variable

as possible, is derived using the MAC algorithm and other sampling algo-

rithms adapted from the literature. The latter are Single Feature Looka-

head (SFL) algorithm (for depth equal to 1 and 5) and the Goal Oriented

Data Acquisition (GODA) algorithm. A MAP classifier over categorical

data is postulated to represent the prediction problem and to evaluate the

comparison, between MAC and competitive algorithms.

Application 2: Concurrent Estimation of Multiple Feature Relevances. The

second main application of the MAC algorithm, investigated in this re-

search, deals with assessing the relevance of multiple variables concurrently.

This application is presented in Section 4.6. Relevance of a feature is de-

fined as the contribution of a variable to predicting the class labels. In

this application we model the underlying joint class-and-features probabil-

ity distribution with a class-dependent mixture model of product distri-

butions. We infer the parameters from incomplete data via Expectation-

Maximization algorithm and compute the components of the benefit func-

tion with the inferred parameters. A brief analysis of the (non-)identifiability

problems of the data generation model and the computational issues re-

lated to this application are provided.

Experiments on Simulated and Real Test Data. We conducted several ex-

periments on synthetic and real data. In the experiments on Example 2

(conditional probability of two binary variables) the MAC algorithm out-

123

6.2. CONCLUSIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK

performed random sampling in 70% of the cases, performed equivalently

in 20% and in the remaining 10% MAC was less efficient than random just

in the very beginning of the sampling process (in the first 10-15 steps) and

then outperformed random sampling again.

In Application 1 (single feature relevance estimation) MAC policy al-

most always outperformed the random policy on every tested dataset; on

these datasets it performed significantly better than any other method

found in the literature. Thus, the MAC sampling algorithm demonstrated

its usefulness in both domains of application (agriculture and medicine).

Experiments on Application 2 (concurrent estimation of the relevances

of multiple features) showed that the MAC algorithm outperforms random

sampling both on synthetic data and real data from benchmark datasets.

Experimentally we observe two useful properties of the MAC algorithm: it

samples more the most relevant features and less the irrelevant features.

To reduce the computations we can evaluate just a small (1%) random

subset of the candidates (i.e., missing entries) and still obtain most of the

efficiency of the full method. This last property needs more investigation.

6.2 Conclusions

This research introduces a novel strategy for collecting data in problems

where collection is expensive and can be done incrementally. It demon-

strates the effectiveness of this strategy for reducing the number of mea-

surements needed to make inferences in comparison to those required by

other sampling methods.

A drawback of goal oriented sampling policies is the reuse of the data

for different targets, which requires careful evaluation. The random sam-

pling policy does not suffer this drawback. Throughout the investigations

made in this research, the random sampling policy has been an enduring

124

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.3. NEW DIRECTIONS

competitor, much more than expected.

The proposed sampling strategy, resulting in the Maximum Average

Change sampling algorithm, was derived in its general form from the theory

of Bayesian experimental design. Even though its practical implementation

can be challenging even in simple cases, we propose its use in all data

collection problems under hard budgets.

6.3 Directions for Future Work

The novelty of our approach and the little scientific literature available on

this topic had the consequence that many unexplored problems and ideas

emerged during this study. In the following we describe some of these

directions which, we believe, are the main topics that should be addressed

from now on.

Relaxing Restrictive Assumptions. In Chapter 4 we introduced some re-

strictive assumptions in order to match the requirements of the problems

presented there and to allow a detailed implementation of the MAC algo-

rithm. These restrictions were about:

• Using categorical data.

• Measuring one missing value of the dataset at a time.

• Knowing class labels in advance (only for class labelled datasets).

• Using a flat cost model.

Each of these items defines a new direction for future research, both from

the point of view of of the results of this work and of the current literature.

125

6.3. NEW DIRECTIONS CHAPTER 6. CONCLUSIONS AND FUTURE WORK

Reduce Computational Complexity. The computational burden of comput-

ing the MAC benefit for each candidate to be sampled is a major issue for

extending experiments for average performance assessment. Moreover all of

the proposed new directions, that aim at relaxing some of the assumptions

just described, have a huge impact in terms of increased computational

complexity. This is a major issue for this research and only preliminary

attempts to solve it were provided (i.e., the subsampling strategy).

Stopping Criterion. A question that arises when sampling expensive data

is when to stop sampling. This research does not address this issue and

relies on the domain of application for an answer. But this issue can be

thought as part of the data collection. We found evidence that, in the initial

sampling steps, the MAC algorithm could perform worse than random

sampling in few cases. Knowing when (and until when) this undesirable

effect holds, would improve the efficacy of using the MAC algorithm. To

the best of our knowledge only Williams in [62] has provided a solution to

this kind of problem, but his solution is applicable only to his proposed

method and context. A general criterion to stop sampling that can be

applied in general is still unknown.

Besides the new directions for medium and long term research described

so far, we plan short term activities as well. They involve mainly Applica-

tion 2, which is the most recent problem we studied. The activities we plan

to work on are: a comparison of Application 2 with Application 1, moti-

vated by the fact that they provides a solution to the same question (how

to improve a prediction model); the second activity is to extend the feature

rater to other algorithms like Relief [26], SIMBA [20] or I-Relief [55]; the

third activity is to test Application 2 on the datasets from agriculture and

medicine presented in this research; a fourth is exploring and characteriz-

ing the subsampling strategy introduced in Section 5.3.3; as a fifth task we

126

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 6.3. NEW DIRECTIONS

need to extend experiments when the data generation model is a mixture

of more than one component.

127

Appendix A

EM for a Mixture of Product

Distributions

In the following we derive Equations 4.39 and 4.43 as application of the

Expectation-Maximization (EM) algorithm (see Section 4.6.7) to the esti-

mation of class-conditional mixture of product distributions (Equation 4.25)

introduced in Section 4.6.4. The two derivations refer to the cases of com-

plete and missing data.

A.1 Complete Data

The class-conditional mixture of product distributions of Section 4.6.4 is

defined by

P (X1 = x1, . . . , XF = xF |C = c) =M∑

m=1

αcm

F∏

f=1

Vf∏

x=1

θδ(x,xf )cmfx (A.1)

where αs and θs are the parameters defining the model. From now on we

focus on one class without losing generality:

P (X1 = x1, . . . , XF = xF ) =M∑

m=1

αm

F∏

f=1

Vf∏

x=1

θδ(x,xf )mfx (A.2)

129

A.1. COMPLETE DATA APPENDIX A. EM

and call θ the whole set of parameters defining the distribution, i.e.,

θ = θmfxm∈M,f∈F,x∈Vf, αmm∈M. (A.3)

In the case of complete data (i.e. no missing values) the latent variables

z of the complete model denotes which of the components of the mixture

actually generated each record in the dataset. We define Z as the N × M

matrix of the latent variables, where Znm = 1 if record xn has been gener-

ated by component m and Znm = 0 otherwise. Note that∑M

m=1 Znm = 1.

Then

p(x|Z•m = 1, θ) =F∏

f=1

Vf∏

x=1

θδ(x,xf )mfx (A.4)

and

αm = p(Z•m = 1|θ) =1

N

N∑

n=1

p(Znm = 1|xn, θ). (A.5)

The E-step of the EM algorithm involve the calculation of

Q(θ, θt) = EZ[lc(θ|D,Z)|D, θt] (A.6)

which is given by

Q(θ, θt) =M∑

m=1

N∑

n=1

p(Znm=1|xn, θt)ln[p(xn|Znm= 1, θ)p(Znm=1|θ)] = (A.7)

=M∑

m=1

N∑

n=1

p(Znm=1|xn, θt)

F∑

f=1

Vf∑

x=1

δ(x, Xnf) ln θmfx

+p(Znm=1|θ)

and the M-step is

θt+1 = arg max

θ∈ΘQ(θ, θt). (A.8)

In the Equation A.7 the last part of Q(θ, θt) involves two terms

A =

F∑

f=1

Vf∑

x=1

δ(x, Xnf) ln θmfx

B = p(Znm = 1|θ) (A.9)

130

APPENDIX A. EM A.1. COMPLETE DATA

which can be maximized separately. We call QA(θ, θt) and QB(θ, θt) the

two parts of Q. In both cases it is a constrained maximization and we

use the Lagrange multipliers to maximize them. The respective set of

constraints are

Vf∑

x=1

θmfx = 1

M∑

m=1

p(Znm = 1|θ) = 1.

Consider the first term

LA(θ) = QA(θ, θt) −M∑

m=1

F∑

f=1

λmf

Vf∑

x=1

θmfx (A.10)

is maximized it when∂LA

∂θ= 0 (A.11)

which means

N∑

n=1

δ(x, Xnf)∂ ln θmfx

∂θmfx

p(Znm = 1|xn, θt) − λmf = 0. (A.12)

rewriting, summing over x, and taking the constraint into account we get

λmf =N∑

n=1

p(Znm = 1|xn, θt) (A.13)

which reinserted in the previous equations gives the first of the two update

equations

θt+1mfx =

∑Nn=1 δ(x, Xnf)hnm∑N

i=1 hnm

(A.14)

where

hnm = p(Znm = 1|xn, θt) =

αm

∏Ff=1 θt

mjxnf∑Mm=1 αm

∏Ff=1 θt

mjxnf

. (A.15)

131

A.2. INCOMPLETE DATA APPENDIX A. EM

Considering QB(θ, θt), again we maximize it when

∂LB

∂θ= 0 (A.16)

which means

∂QB(θ, θt) −∑M

m=1 λmp(Znm = 1|θ)

∂p(Znm = 1|θ)= 0. (A.17)

Expanding the partial derivative we get

N∑

n=1

p(Znm = 1|xn, θt)

∂ ln p(Znm = 1|θ)

∂p(Znm = 1|θ)− λm = 0 (A.18)

rewriting, summing over m and considering the constraint we obtain λm

λm =M∑

m=1

N∑

n=1

p(Znm = 1|xn, θt) (A.19)

which reinserted in the last equation gives

p(Znm = 1|θ) =

∑Mm=1

∑Nn=1 p(Znm = 1|xn, θ

t)∑M

m=1

∑Nn=1 p(Znm = 1|θt)

(A.20)

which means

αt+1m =

1

N

N∑

n=1

hnm. (A.21)

A.2 Incomplete Data

When some entries in the dataset are missing, we consider them as new

latent variables. The application of the EM algorithm is identical to the

previous case, with the only exception of making expectations over the

missing data as well.

The E-step is

Q(θ, θt) = E[lc(θ|Do, Dm, Z)|Do, θt)] (A.22)

132

APPENDIX A. EM A.2. INCOMPLETE DATA

where Do is the observed part of the dataset and Dm the missing part.

The complete likelihood

lc(θ|Do, Dm, Z) = ln [p(Do, Dm|Z, θ)p(Z|θ)] =

= ln p(Do, Dm|Z, θ) + ln p(Z|θ) (A.23)

is then made of two parts, A = p(Do, Dm|Z, θ) and B = p(Z|θ). Q can

the be maximized by considering each of the two terms separately

Q(θ, θt) = E[ln p(Do, Dm|Z, θ)] + E[ln p(Z|θ)] = QA + QB. (A.24)

We consider the first term and rewrite it by taking into account the

MAR assumption (see Section 3.2.1):

p(Do, Dm|Z, θ) = p(Do, |Z, θ)p(Dm|Z, θ) (A.25)

so QA becomes

QA(θ, θt) =N∑

n=1

M∑

m=1

E[Znm ln p(xn|Zm, θ)]. (A.26)

Each expectation in the double summation of Equation A.26 can be rewrit-

ten as the sum of its observed and missing parts

E[Znm ln p(xn|Zm, θ)] =∑

f∈obs (n)

Vf∑

x=1

E[Zm|Do, θt]δ(xnf , x) ln θmfx +

+∑

f∈mis(n)

Vf∑

x=1

E[Znmδ(xnf,x)|Do,θt]ln θmfx(A.27)

Where obs (n) is the set of observed variables for record xn and mis (n)

the set of missing ones. Note that in the second term xnf is unknown

and cannot be brought outside the expectation. We define honm the first

expectation, i.e. the expected value of Znm given the observed data and

133

A.2. INCOMPLETE DATA APPENDIX A. EM

θt, then

honm = E[Znm|D

o, θt] =p(xo

n|Znm = 1, θt)∑N

n=1 p(Znm = 1, θt)∑M

m=1 p(xon|Znm = 1, θt)

∑Nn=1 p(Znm = 1|θt)

(A.28)

which can be computed since we know all the parameters at step t and

because

p(Do|Z, θ) =N∏

n=1

∏

f∈obs (n)

Vf∏

x=1

θδ(xnf ,x)

m(n),f,x=

N∏

n=1

∏

f∈obs (n)

Vf∏

x=1

θδ(xnf ,x)Znm

mfx . (A.29)

The final formula for honm is then

honm =

αtm

∏f∈obs(n) θ

tmfxnj∑M

m=1 αtm

∏f∈obs(n) θ

tmjxnf

. (A.30)

For the second expectation, that corresponds to the missing part of the

dataset, we note that

p(Znmδ(xnf , x)|Do, θt) = p(Znm|Do, θt)p(δ(xnf , x)|Znm, θt) (A.31)

by Bayes formula and the independence of values of different features inside

each component. Then

E[Znmδ(xnf , x)|Do, θt] = E[Znm|Do, θt]E[δ(xnf , x)|Znm, θt]

= honmp(xnf = x|Znm, θt). (A.32)

Taking into account that

p(Dm|Z, θ) =N∏

n=1

∏

f∈mis (n)

Vf∏

x=1

θδ(xnf ,x)

m(n),f,x=

N∏

n=1

∏

f∈mis (n)

Vf∏

x=1

θδ(xnf ,x)Znm

mfx . (A.33)

and inserting Equation A.28 and A.32 into Equation A.27 we obtain the

explicit form of QA:

QA(θ, θt) =N∑

n=1

M∑

m=1

F∑

f=1

Vf∑

x=1

[θtmfx(1 − Inf) + δ(xnf , x)Inf

]ho

nm ln θmfx

134

APPENDIX A. EM A.2. INCOMPLETE DATA

where I is the indicator matrix, i.e., Inf = 1 if the variable f of record n is

observed, otherwise Inf = 0.

In order to maximize QA under the constraint that∑Vf

x=1 θmfx = 1

we use the Lagrange multipliers in the same way we did for the case of

complete data. It is straightforward to obtain

θt+1mfx =

∑Nn=1 ho

nm[θtmfx(1 − Inf) + δ(xnf , x)Inf ]

∑Nn=1 ho

nm

. (A.34)

We note that the maximization of QB is analogous to the case of com-

plete data, yielding an equivalent update equations of the α parameters:

αt+1m =

1

N

N∑

n=1

honm. (A.35)

135

Bibliography

[1] David Ahl. What to Do After You Hit Return or P.C.C.’s First Book

of Computer Games. Nowels Publications, Menlo Park, CA, 1975.

[2] Jose M. Bernardo. Expected information as expected utility. The

Annals of Statistics, 7(3):686–690, May 1979.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning

(Information Science and Statistics). Springer, August 2006.

[4] Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training

algorithm for optimal margin classifiers. In Computational Learing

Theory, pages 144–152, 1992.

[5] Miguel A. Carreira-Perpinan and Steve Renals. Practical identifiabil-

ity of finite mixtures of multivariate bernoulli distributions. Neural

Computation, 12(1):141–152, 2000.

[6] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review.

Statistical Science, 10:273–304, 1995.

[7] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with

statistical models. In G. Tesauro, D. Touretzky, and T. Leen, editors,

Advances in Neural Information Processing Systems, volume 7, pages

705–712. MIT Press, 1995.

137

BIBLIOGRAPHY BIBLIOGRAPHY

[8] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. In-

troduction to Algorithms. MIT Press and McGraw-Hill, 1990.

[9] Thomas M. Cover and Joy A. Thomas. Elements of Information The-

ory. Wiley-Interscience, New York, 1991.

[10] Herald Cramer. Mathematical Methods of Statistics. Princeton Uni-

versity Press, Princeton, NJ, 1946.

[11] Pierre-Simon de Laplace. Philosophical essay on probabilities.

Springer-Verlag, New York, 1995. Translated by A.I. Dale from the

fifth French edition, 1825.

[12] F. Demichelis, A. Sboner, M. Barbareschi, and R. Dell’Anna. Tma-

boost: an integrated system for comprehensive management of tissue

microarray data. IEEE Trans Inf Technol Biomed, 10(1):19–27, 2006.

[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood

estimation from incomplete data via the em algorithm (with discus-

sion). Journal of the Royal Statistical Society Series B, 39:1–38, 1977.

[14] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Clas-

sification. Wiley-Interscience, October 2000.

[15] E. Engvall and P. Perlman. Enzyme-linked immunosorbent assay

(elisa). quantitative assay of immunoglobulin g. Immunochemistry,

8(9):871–874, 1971.

[16] Pukelsheim F. Optimal Design of Experiments. John Wiley and Sons,

New York, 1993.

[17] V. V. Fedorov. Theory of optimal experiments. Academic Press, New

York, 1972.

138


[18] I. Ford, C.P. Kitsos, and D.M. Titterington. Recent advances in

nonlinear experimental design. Technometrics, 31(1):49–60, February

1989.

[19] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from

incomplete data via an EM approach. In Jack D. Cowan, Gerald

Tesauro, and Joshua Alspector, editors, Advances in Neural Informa-

tion Processing Systems, volume 6, pages 120–127. Morgan Kaufmann

Publishers, Inc., 1994.

[20] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Margin based

feature selection - theory and algorithms. In Proceedings of the

twenty-first international conference on Machine learning (ICML-04),

page 43, New York, NY, USA, 2004. ACM Press.

[21] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P.

Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D.

Bloomfield, and E.S. Lander. Molecular classification of cancer: class

discovery and class prediction by gene expression monitoring. Science,

286(5439):531–537, 1999.

[22] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, edi-

tors. Feature Extraction, Foundations and Applications. Series Studies

in Fuzziness and Soft Computing. Springer, 2006.

[23] Mats Gyllenberg, Timo Koski, Edwin Reilink, and Martin Verlaan.

Non-uniqueness in probabilistic numerical identification of bacteria.

Journal of Applied Probability, 31(2):542–548, June 1994.

[24] Harold Jeffries. Thoeory of probability. Clarendon Press, Oxford, 1948.

[25] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source

scientific tools for Python, 2001–. http //www.scipy.org/.

139


[26] Kenji Kira and Larry A. Rendell. A practical approach to feature

selection. In Proceedings of the Ninth International Workshop on Ma-

chine Learning (ML-92), pages 249–256, San Francisco, CA, USA,

1992. Morgan Kaufmann Publishers Inc.

[27] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection.

Artificial Intelligence, 97(1-2):273–324, 1997.

[28] J. Kononen, L.Bubendorf, A.Kallioniemi, M.Barlund, P.Schraml,

S.Leighton, J.Torhorst, M.Mihatsch, G.Seuter, and O.P.Kallioniemi.

Tissue microarrays for high-throughput molecular profiling of tumor

specimens. Nature Medicine, 4(7):844–847, 1998.

[29] K.Pelckmans, J.De Brabanter, J.A.K.Suykens, and B.De Moor. Han-

dling missing values in support vector machine classifiers. Neural Net-

works, 18:684–692, 2005.

[30] Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin,

Mario A. T. Figueiredo, and Alexander J. Hartemink. Active learn-

ing of features and labels. In Proceedings of the 22nd International

Conference on Machine Learning (ICML-05), pages 43–50, 2005.

[31] D.V. Lindley. On a measure of the information provided by an experi-

ment. The Annals of Mathematical Statistics, 27(4):986–1005, Decem-

ber 1956.

[32] D.V. Lindley. Bayesian Statistics - A Review. SIAM, Philadelphia,

1972.

[33] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing

Data. John Wiley and Sons, New York, 1987.

140


[34] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-

bayes classifiers. In Proceedings of the 19th Annual Conference on

Uncertainty in Artificial Intelligence (UAI-03), pages 378–385, 2003.

[35] D. J. C. MacKay. Information-based objective functions for active

data selection. Neural Computation, 4(4):590–604, 1992.

[36] J.M. Marin, K. Mengersen, and C.P. Robert. Bayesian modelling and

inference on mixtures of distributions, volume 25 of Handbook of Statis-

tics (D. Dey and C.R. Rao eds.). Elsevier-Sciences, November 2005.

[37] Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley,

1st edition, October 2000.

[38] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Ray-

mond Mooney. Active feature-value acquisition for classifier induc-

tion. In Proceedings of the Fourth IEEE International Conference on

Data Mining (ICDM’04), pages 483–486, Washington, DC, USA, 2004.

IEEE Computer Society.

[39] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Ray-

mond Mooney. Economical active feature-value acquisition through

expected utility estimation. In Proceedings of the First International

Workshop on Utility-Based Data Mining (KDD 2005), pages 10–16,

Chicago (IL), USA, August 2005.

[40] Donald Michie. Memo functions and machine learning. Nature,

(218):19–22, 1968.

[41] Todd K. Moon and Wynn C. Stirling. Mathematical Methods and

Algorithms for Signal Processing. Prentice Hall, Inc., 2000.

141


[42] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI reposi-

tory of machine learning databases, 1998. http://www.ics.uci.edu/

~mlearn/MLRepository.html.

[43] Peter Norvig. Techniques for automatic memoization with applica-

tions to context-free parsing. Computational Linguistics, 17(1):91–98,

March 1991.

[44] T. E. Oliphant. Python for scientific computing. Computing in Science

& Engineering, 9(3):10–20, 2007.

[45] Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani.

Computational Methods of Feature Selection, chapter 5, pages 89–107.

Chapman & Hall / CRC, 2008.

[46] Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always

good turing: Asymptotically optimal probability estimation. Science,

302(5644):427–431, October 2003.

[47] P.Melville and R.Mooney. Diverse ensamble for active learning. In

Proceedings of the 21st International Conference on Machine Learning

(ICML-2004), pages 584–591, 2004.

[48] J.L. Schafer. Analysis of Incompete Multivariate Data. Number 72

in Monographs on Statistics and Applied Probability. Chapman &

Hall/CRC, 1997.

[49] B. Scholkopf and A.J. Smola. Learning with Kernels. MIT Press,

Cambridge, MA, USA, 2002.

[50] P. Sebastiani and H. P. Wynn. Maximum entropy sampling and opti-

mal Bayesian experimental design. Journal of Royal Statistical Society,

pages 145–157, 2000.

142


[51] E. Seemuller. Apple proliferation. In Compendium of apple and pear

diseases, pages 67–68. American Phytopathological Society, St. Paul,

Minnesota, USA, 1990.

[52] E. Seemuller and B.C. Kirkpatrick. Detection of phytoplasma infec-

tions in plants. Molecular and Diagnostic Procedures in Mycoplasmol-

ogy, 2:299–311, 1996.

[53] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee.

In Proceedings of the fifth annual workshop on Computational learning

theory (COLT-92), pages 287–294, 1992.

[54] Victor S. Sheng and Charles X. Ling. Feature value acquisition in

testing: A sequential batch test algorithm. In Proceedings of the 23rd

international conference on Machine learning (ICML-06), pages 809–

816, New York, NY, USA, 2006. ACM Press.

[55] Yijun Sun. Iterative relief for feature weighting: Algorithms, theo-

ries, and applications. IEEE Trans. on Pattern Analysis and Machine

Intelligence (TPAMI), 29(6):1035–1051, June 2007.

[56] Kah Kay Sung and Partha Niyogi. Active learning for function approx-

imation. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances

in Neural Information Processing Systems, volume 7, pages 593–600.

The MIT Press, 1995.

[57] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, J. Cheng,

K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher, R. Hamann,

K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski,

T. Mitchell, P. Pachowicz, Y. Reich H. Vafaie, W. Van de Welde,

W. Wenzel, J. Wnek, and J. Zhang. The monk’s problems - a perfor-

mance comparison of different learning algorithms. Technical Report

CS-CMU-91-197, Carnegie Mellon University, December 1991.

143


[58] S. Tong and D. Koller. Support vector machine active learning with

applications to text classification. In Proceedings of the Seventeenth In-

ternational Conference on Machine Learning (ICML-00), pages 999–

1006, 2000.

[59] S. Veeramachaneni, F. Demichelis, E. Olivetti, and P. Avesani. Active

sampling for knowledge discovery from biomedical data. In Proceeding

of the 9th European Conference on Principles and Practice of Knowl-

edge Discovery in databeses (PKDD-05), pages 343–354, 2005.

[60] Sriharsha Veeramachaneni, Emanuele Olivetti, and Paolo Avesani. Ac-

tive sampling for detecting irrelevant features. In Proceedings of the

23rd international conference on Machine learning (ICML-06), pages

961–968, New York, NY, USA, 2006. ACM Press.

[61] Wikipedia. Number (game), 2007. http://en.wikipedia.org/wiki/

Number.

[62] David Williams. Classification and Data Acquisition with Incomplete

Data. PhD thesis, Duke University, Department of Electrical and

Computer Engineering, May 2006.

[63] David Williams, Xuejun Liao, Balaji Krishnapuram, Ya Xue, and

Lawrence Carin. On incomplete-data classification. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 29(3):427–436,

March 2007.

[64] Bianca Zadrozny. Learning and evaluating classifiers under sample

selection bias. In C.E.Brodley, editor, Proceedings of the 21st Interna-

tional Conference on Machine Learning (ICML-2004), pages 114–121,

2004.

144


[65] Z. Zheng and B. Padmanabhan. On active learning for data acquisi-

tion. In Proceedings of the International Conference on Datamining

(ICDM-02), pages 562–570, 2002.

[66] Xingquan Zhu and Xingdong Wu. Data acquisition with active and

impact-sensitive instance selection. In Proceedings of the 16th Inter-

national Conference on Tools with Artificial Intelligence (ICTAI-04),

pages 721–726, 2004.

145

disi - university of trento sampling strategies for...

Documents