a probabilistic framework to learn from multiple annotators with time-varying...

A Probabilistic Framework to Learn from Multiple Annotators with

Time-Varying Accuracy

Pinar Donmez

Language Technologies Institute

Carnegie Mellon University

[email protected]

Jaime Carbonell

Language Technologies Institute


[email protected]

Jeff Schneider

Robotics Institute


[email protected]

Abstract

This paper addresses the challenging problem of learning

from multiple annotators whose labeling accuracy (reliabil-

ity) differs and varies over time. We propose a framework

based on Sequential Bayesian Estimation to learn the ex-

pected accuracy at each time step while simultaneously de-

ciding which annotators to query for a label in an incremen-

tal learning framework. We develop a variant of the particle

filtering method that estimates the expected accuracy at ev-

ery time step by sets of weighted samples and performs se-

quential Bayes updates. The estimated expected accuracies

are then used to decide which annotators to be queried at

the next time step. The empirical analysis shows that the

proposed method is very effective at predicting the true label

using only moderate labeling efforts, resulting in cleaner la-

bels to train classifiers. The proposed method significantly

outperforms a repeated labeling baseline which queries all

labelers per example and takes the majority vote to predict

the true label. Moreover, our method is able to track the true

accuracy of an annotator quite well in the absence of gold

standard labels. These results demonstrate the strength of

the proposed method in terms of estimating the time-varying

reliability of multiple annotators and producing cleaner, bet-

ter quality labels without extensive label queries.

1 Introduction

In many data mining applications, obtaining expert la-bels is a time-consuming and costly process. More im-portantly, obtaining noise-free or low-noise labels is cru-cial to build reliable systems/models. Recent advancesin crowdsourcing tools (e.g. Amazon Mechanical Turk)provide cheaper, faster ways of acquiring labels for largedata collections. Despite crowdsourcing not providingvery reliable labels, for mostly simple tasks these tools

are shown to be reliable in producing quality labels onaverage[12]. Nevertheless, it is still unclear to what ex-tent such tools can be trusted for tasks that requireexpertise higher than unvetted average annotators canhandle. Furthermore, using human labelers (annota-tors) is not the only way to acquire labels. For instance,expensive machinery and wet lab experiments are oftenapplied in protein structure prediction. In medical di-agnosis, blood or urine tests, certain type of biopsy, etc.are used to diagnose a patient (i.e. to produce diagnosislabels). These procedures range from cheap and par-tially reliable to more costly and more reliable. Hence,it is essential to use multiple sources to increase confi-dence while maintaining a trade-off between quality andcost. This requires estimating the accuracy (reliability)of multiple prediction sources and identifying the highlyaccurate ones to rely on.

A dimension not yet addressed is that in manyapplication areas, providers of information may exhibitvarying accuracy over time, and it is therefore usefulto track such variation in order to know when and howmuch to rely upon their answers. For instance, scientificequipment may lose calibration over time, and thusincrease measurement error until upgraded returning topeak performance or better. Text topic labelers becomecareless with time due to fatigue effects, but may berefreshed for future labeling sessions or hone their skillsto produce increasingly reliable results.

In this paper, we tackle the problem of learning frommultiple sources whose accuracy may vary over time. Tothe best of our knowledge, this is the first attempt in theliterature that addresses this phenomenon. Throughoutthe paper, we use the terms accuracy, reliability, qualityinterchangeably and also the terms labeler, annotator,

predictor, and source interchangeably. We present anew algorithm based on sequential Bayesian estimationthat continuously and selectively tracks the accuracyof each labeler and selects the top quality one(s) ateach time step. We assume each time step correspondsa single example to be labeled. This framework alsoallows aggregating the observed labels to predict thetrue label for the given example. The experimentalanalysis shows the effectiveness of our approach for 1)improving the quality (reducing noise) of the labeleddata, 2) tracking the accuracy drift of multiple sourcesin the absence of ground truth, and 3) consistentlyidentifying the low-quality labelers and reducing theiraffect in learning. We note that we assume the labeleraccuracy changes gradually over time without abruptlarge shifts. However, we analyze the increased rateof change and report its effect on the performance.We conclude with high confidence that our method isrelatively robust to higher variations.

We organize the rest of the paper as follows. Thenext section reviews related work in the literature. Sec-tion 3 explains in detail the proposed framework, fol-lowed by the empirical evaluation in Section 4. Finally,the last section offers our conclusions and future direc-tions.

2 Related Work

Collecting multiple annotations in the absence of goldstandard labels is becoming a more common practicein the literature. Raykar et al. [9] propose an EM-based algorithm to estimate the error rate of multipleannotators assuming conditional independence of theannotator judgments given the true label. Their methoditeratively estimates the gold standard, and measuresthe performance of multiple annotators and updatethe gold standard based on the performance measures.Dekel and Shamir [4] offer a solution to identify low-quality or malicious annotators. But their frameworkis rather limited in the sense that it is based only onSupport Vector Machines (SVMs). More importantly,they assume that each annotator is either good or bad,not in a continous distribution and not time-varying.Good annotators assign labels based on the marginaldistribution of the true label conditioned on the instancewhereas bad annotators provide malicious answers.

There are earlier attempts that address multipleimperfect annotators, but they do not infer the per-formance of each individual annotator. Sheng et al.[11] and Snow et al. [12] show that it can be effectiveto use multiple, potentially noisy labels in the absenceof gold standard. Sheng et al. [11] takes the major-ity voting to infer a single integrated label. Further-more, their experimental analysis relies on the assump-

tion that all labelers have the same labeling quality,which is very unlikely to hold in real world. Snow et al.[12] collected labeled data through Amazon MechanicalTurk for simple natural language understanding tasks.They empirically show high agreement between the ex-isting gold-standard labels and non-expert annotations.Their analysis is carried out on fairly straightforwardtasks and collecting a reasonably large number of non-experts labels. However, it is unclear if their conclusionswould generalize to other tasks that require better-than-average expertise and under high label acquisition costwhich restricts the total number of labelings one canafford.

Our framework differs from the previous body ofwork in a variety of ways. The previous works all as-sume the accuracy (or the labeling quality) of the an-notators are fixed. Our framework explicitly deals withnon-stationary labeling accuracy. We model the accu-racy of each annotator as a time-varying unobservedstate sequence without directional bias. For instance,an annotator might learn the task over time and getbetter in labeling. Also, she might occasionally get tiredand her performance drops but it increases again afterenough rest and so on. Our framework provides thenecessary tools to track this change with the assump-tion that the maximal degree of the change is known.Another example is the case where the annotators aretrained classifiers. As one can imagine, the performanceof a classifier improves with more training data but itmay decrease due to noise in the labels or over-fitting.Hence, such classifiers are good examples for annotatorswith time-varying accuracies.

Another point where the proposed work differs isthe identification and selection of the most reliableannotators for each instance to be labeled. Donmez etal. [5] propose a solution that balances the explorationvs. exploitation trade-off, but it is only applicable whenthe accuracies are fixed over time. The proposed model,however, constantly monitors the potential changesin little- or non-explored annotators. It may selectpreviously rejected annotators if there is some chancethat their accuracies now exceed that of the alreadyexploited ones. This leads to reliable estimates ofchanging annotator quality with very modest labelingcost.

3 Estimation Framework

In this section, we describe our estimation and selec-tion framework in detail. We start with the underlyingparticle filtering algorithm and the specific probabilis-tic distributions we modeled. Then, we discuss how tomodify this model to estimate the varying accuracy ofeach labeler and simultaneously select the most accu-

rate ones. Our framework supports a balance betweenexploration and exploitation to achieve estimation ac-curacy without extensively exploiting the labelers andincurring associated costs.

3.1 Sequential Bayesian Estimation Particle fil-tering is a special case of a more general family of modelscalled Bayesian Sequential Estimation. The Bayesianapproach to estimate a system that dynamically changesis to construct the posterior probability density function(pdf) of the state of the system based on all informationobserved up to that point. Inference on such a systemrequires at least two models: a model describing theevolution of the state with time and a model govern-ing the generation of the noisy observations. Once thestate space is probabilistically modeled and the infor-mation is updated based on new observations, we areprovided with a general Bayesian framework to modelthe dynamics of a changing system.

The problem of estimating the time-varying accu-racy of a labeler can be cast into the sequential estima-tion framework. The states of the system correspondto the unknown time-varying accuracy of the labelerwhere φt represents the accuracy of the labeler at timet. The observations are the noisy labels zt output by thelabeler according to some probability distribution gov-erned by the corresponding labeling accuracy φt. In thisproblem, it is important to estimate the accuracy everytime an observation is obtained. Hence, it is crucial tohave an estimate of the accuracy of each labeler to infera more accurate prediction. Or equivalently, the esti-mation update is required at each step to decide whichlabelers to query next. For problems where an estimateis required every time an observation is made, sequen-tial filtering approach offers a convenient solution. Suchfilters alternate between prediction and update stages.The prediction stage predicts the next state given all thepast observations. The update stage modifies the pre-dicted prior from the previous time step to obtain theposterior probability density of the state at the currenttime.

In this paper, we design this dynamic system withthe following probabilistic models. Let φt denote thelabeling accuracy and it is assumed to change accordingto the following model:

φt = ft(φt−1,∆t−1)

= φt−1 + ∆t−1(3.1)

where ∆t is a zero-mean, σ2-variance Gaussian randomvariable. ft denotes that the accuracy at the currenttime step differs from the previous accuracy by somesmall amount drawn from a Gaussian distribution. Themean of the Gaussian is assumed to be zero not to in-

troduce any bias towards increase or decrease in accu-racy. We can easily add a bias, e.g. positive mean µ forlabelers who are expected to improve with experience.However, we prefer fewer parameters and we can stilltrack the improving sources (more details in Section 4).We assume σ or at least its upper bound to be known.We realize that the exact value of σ can be difficult toobtain in many situations. On the other hand, it is of-ten reasonable to assume the maximum rate of changeof the labeling accuracy is known. We added an analysistesting the sensitivity to the exact σ in the experimentssection 4. For notational simplicity and concreteness,we focus on binary classification here; extensions to themulti-category case are straightforward generalizationsto this scheme. Furthermore, the accuracy at any timet is assumed to be between 0.5 and 1, i.e. any labeler isa weak learner, which is a standard assumption. Thus,the transition probability from one state to the nextfollows a truncated Gaussian distribution:

p(φt | φt−1, σ, 0.5, 1) =1σβ(φt−φt−1

σ)

Φ( 1−φt−1

σ) − Φ( 0.5−φt−1

σ)

(3.2)

where β and Φ are the pdf and cdf of the standardGaussian distribution, respectively.

The observed variables zj1:t are the sequentially

arriving noisy labels generated by labeler j at thecorresponding time steps. We model the noisy labelgeneration with a Bernoulli distribution given the truelabel y. In other words, the noisy label zt is a randomvariable whose probability conditioned on the accuracyφt at time t and the true label yt is

p(zjt | φj

t , yt) = φjt

I(zjt =yt)

(1 − φjt )

I(zjt 6=yt).(3.3)

(3.3) requires yt, which is unknown. We use the wisdom-of-the-crowds trick to predict yt. Specifically, we esti-mate the probability of the true label yt conditioned onthe noisy labels observed from all the other labelers.

p(zjt | φj

t , zJ(t)t ) =

∑y∈Y

p(zjt | φj

t , yt = y)P (yt = y | zJ(t)t )

(3.4)

J(t) is the set of selected labelers at time t such that

j /∈ J(t); hence, zJ(t)t = {zs

t | s ∈ J(t)}. We compute

P (yt = y | zJ(t)t ) using the estimated expected accuracy

of the selected labelers and the estimated Pt−1(y) fromthe previous time step, assuming conditional indepen-dence.

P (yt = y | zJ(t)t ) =

Pt−1(y)∏

s∈J(t) pE[φs

t−1](zs

t | y)∑y∈Y Pt−1(y)

∏s∈J(t) p

E[φst−1]

(zst | y)

(3.5)

Φt-1

Φt

Φt+1

……

zt-1

zt

zt+1

Figure 1: The Hidden Markov Model structure for thetime-varying accuracy of a labeler. The state sequenceφt is an unobserved first-order Markov process. Theobservation zt is dependent only on the current state φt

for all t = 1, 2, . . . .

We describe how to estimate E[φst−1] and P (y) in the

next section. Here we focus on the state space modeland how it leads to an estimate of the posterior statedistribution p(φt | z1:t) at any given time t.

The true state sequence φt is modeled as an unob-served Markov process that follows a first-order Markovassumption. In other words, the current state is inde-pendent of all earlier states given the immediately pre-vious state.

p(φt | φ0, . . . , φt−1) = p(φt | φt−1)(3.6)

Similarly, the observation zt depends only on the currentstate φt and is conditionally independent of all theprevious states given the current state.

p(zt | φ0, . . . , φt) = p(zt | φt)(3.7)

Figure 1 shows the graphical structure of this model.As noted earlier, we are interested in constructingp(φt | z1:t). It is generally assumed that the initialstate distribution p(φ0) is given. However, in thispaper, we do not make this assumption and simplyadopt a uniform (noninformative) prior for p(φ0) where0.5 < φ0 < 1.1 Suppose that the state distributionp(φt−1 | z1:t−1) at time t−1 is available. The predictionstage obtains the pdf of the state at time step t via theChapman-Kolmogorov equation[1].

p(φt | z1:t−1) =

∫φt−1

p(φt | φt−1)p(φt−1 | z1:t−1)dφt−1

(3.8)

1The results show that the estimation is very effective despitethe uninformative prior (See Section 4 for more details). However,

in cases where such information is available, we anticipate thatthe results could be improved even further.

where p(φt | φt−1) is substituted with (3.2). Once a newobservation zt is made, then (3.8) is used as a prior toobtain the posterior p(φt | z1:t) at the update stage viaBayes rule.

p(φt | z1:t) =p(zt | φt)p(φt | z1:t−1)

p(zt | z1:t−1)(3.9)

where p(zt | φt) is substituted with (3.4) and p(zt |z1:t−1) =

∫φt

p(zt | φt)p(φt | z1:t−1)dφt. When the

posterior (3.9) at every time step t is assumed to beGaussian, then Kalman filters provide optimal solutionsto the state density estimation. In cases like ours wherethe posterior density is not Gaussian, sequential particlefiltering methods are appropriate for approximating theoptimal solution. Next, we review the particle filteringalgorithm and describe how it is modified to tackle ourproblem.

3.2 Particle Filtering for Estimating Time-

Varying Labeler Accuracy The particle filtering al-gorithm is a technique for implementing sequentialBayesian estimation by Monte Carlo simulations. Theunderlying idea is to estimate the required posteriordensity function with a discrete approximation usinga set of random samples (particles) and associatedweights.

p(φt | z1:t) ∼N∑

i=1

witδ(φt − φi

t)(3.10)

where N is the number of particles. As N increases, itcan be shown that the above approximation approachesthe true posterior density [7]. wi

t’s are called thenormalized importance weights, and δ is the Dirac deltafunction2. The particle filtering algorithm estimatesthese weights in a sequential manner. Weights in (3.10)are approximated as

wit ≈

p(φi1:t | z1:t)

π(φi1:t | z1:t)

wit =

wit∑N

j=1 wjt

(3.11)

where π is called the importance density from which thesamples φi are generated. This is because in general wecannot directly sample from the posterior p(φ1:t | z1:t),

2

δ(x) = 0, if x 6= 0Z

∞

−∞

δ(x)dx = 1

1. Sample from the initial distribution φi0 ∼ p(φ0) for

i = 1, . . . , N and assign weights wi0 = 1

N

2. For t > 0, and i = 1, . . . , N

• Draw φit ∼ p(φt | φt−1) using (3.2) and update

weights wit = p(zt | φi

t)wit−1.

• Normalize the weights wit =

wit−1

P

Nj=1 w

j

t−1

and

compute Ne = 1P

Ni=1(w

it)

2 .

• If Ne < T , then resample φit ∼ pmf[{wt}] and

reassign wit = 1

N

• Update the posterior state densityp(φt | z1:t) =

∑N

i=1 witδ(φt − φi

t)

3. Update t = t + 1 and go to step 2.

Figure 2: The pseudo-code for the basic filtering algo-rithm

but rather use an importance density that we can easilydraw the samples from. The algorithm sequentiallysamples φi

t, i = 1, 2, . . . , N , from the importancedensity and updates the weights wi

t to approximatethe posterior state distribution. The choice of theimportance density is important and one commonlyused and convenient choice is to use the state transitiondensity p(φt | φt−1) (in our case given in (3.2)), i.e.π(φt | φi

1:t−1, z1:t) = p(φt | φt−1). Then, the weightupdate equation simplifies to

wit ≈

p(zt | φit)p(φi

t | φit−1)p(φi

1:t−1 | z1:t−1)

π(φit | φi

1:t−1, z1:t)π(φi1:t−1 | z1:t−1)

=p(zt | φi

t)p(φit | φi

t−1)

π(φit | φi

1:t−1, z1:t)wi

t−1

= p(zt | φit)w

it−1 for t > 1(3.12)

where p(zt | φit) is substituted with (3.4). The pseudo-

code of the basic filtering algorithm is given in Figure 2.The resampling step in Figure 2 happens only if the

vast majority of the samples have negligible weights; inother words, the estimated effective sample size Ne fallsbelow some predefined threshold. The exact value of thethreshold is not crucial, the idea is to detect significantlysmall weights. This is called the degeneracy problem inparticle filters, and is resolved by resampling using aprobability mass function (pmf) over the weights. Thegoal of resampling is to eliminate particles with smallweights and concentrate on large ones. It is shown thatit is possible to implement this resampling procedure inO(N) time complexity using order statistics [10, 3]. Wedo not provide more details on this since it is out of the

scope of the paper, but we note that in our simulationsthe algorithm hardly needs to resample, largely becausethe importance density we adopted (3.2) represents thestate transitions well.

3.3 Particle Filtering for Labeler Selection Sofar, we have focused on estimating the posterior statedistribution at any given time t. Our problem, however,consists of multiple labelers (annotators) where eachlabeler’s time-varying accuracy is modeled as a separatestate sequence model. Hence, at any time t we mayhave multiple noisy observations (labels) from multiplelabelers. In this section, we describe how to select whichlabelers to query for the labels and how to utilize thesenoisy labels to predict the true label and hence estimatethe marginal label distribution P (y).

Our aim is to select potentially the most accuratelabelers at each time step. The sequential estimationframework provides us with an effective tool to achievethis goal. First, we approximate the prior pdf of φj

t (3.8)∀j by its discrete version

p(φjt | Zj

t−h(j)) =N∑

i=1

p(φjt | φj,i

t−h(j))p(φj,i

t−h(j) | Zj

t−h(j))

(3.13)

where t − h(j) denotes the last time the labeler j isselected, e.g. t−h(j) = t−1 if the labeler is queried forlabeling at the immediately previous time step. Zj

t−h(j)

denotes all the observations obtained from labeler j upto time t − h(j). {φj,i

t−h(j)}Ni=1 denotes the sampled

accuracy values for the j-th labeler at time t − h(j).Furthermore, we formulate the change in the labelingaccuracy as a Gaussian random walk bounded between0.5 and 1. More importantly, the true accuracy of alabeler keeps evolving according to this random walkregardless of whether it is selected by our method.For computational expediency, we approximate thestate transition probability by truncating the Gaussiandistribution once after h(j) steps instead of after everysingle step for an unxplored labeler. More formally,recall (3.1) where the step size ∆t is drawn from aGaussian ∆t ∼ N (0, σ2). Hence,

φjt = φj

t−1 + ∆t−1

= φjt−2 + ∆t−2 + ∆t−1

...

= φj

t−h(j) +

h(j)∑r=1

∆t−r

φjt ∼ N (φj

t | φj

t−h(j), h(j)σ2, 0.5 < φjt < 1)(3.14)

(3.14) captures our intuition that the more a labeler jgoes unexplored (h(j) increases), the further our beliefabout its accuracy diverges from the last time it wasestimated (variance h(j)σ2 increases).

Next, we draw samples {φj,it }N

i=1 from the distribu-tion given above in (3.14). We weight these samplesby their corresponding predicted probability (shownin (3.13)) given what has been observed up to time t:

bj,it = p(φj

t | Zj

t−h(j))φj,it(3.15)

For each j, we sort the weighted samples bj,it in ascend-

ing order and find their 95th percentile. This representsthe value of the upper 95th confidence interval for la-beler j. We then apply the IEThresh method from [5]by selecting all labelers whose 95th percentile is higherthan a predefined threshold T . We denote the set of se-lected labelers at time t by J(t). This selection allows usto take the cumulative variance into account and selectthe labelers with potentially high accuracies. We tunedthe parameter T on a separate dataset not reported inthis paper. It can be further adjusted according to thebudget allocated to label acqusition or to prevent unnec-essary sampling especially when the number of labelersis large.

After the committee J(t) of selected labelers isidentified, we observe their outputs zj

t and update theweights wj,i

t using (3.12) and update the posterior p(φjt |

Zjt ) using (3.10). Relying on the posterior accuracy dis-

tribution, we compute the estimated expected accuracyas follows:

Ep(φj

t |Zjt )[φ

jt ] =

N∑i=1

p(φj,it | Zj

t )φj,it(3.16)

Finally, we integrate the observed noisy labels from theselected committee to predict the true label using a mapestimate of the estimated posterior distribution:

ymapt = arg max

y∈YP (y | Z

J(t)t )

= arg maxy∈Y

Pt−1(y)∏

j∈J(t)

pE[φj

t ](zj

t | y)

= arg maxy∈Y

Pt−1(y)∏

j∈J(t)

E[φjt ]

I(zjt =y)

(1 − E[φjt ])

I(zjt 6=y)(3.17)

We then use the predicted labels to update the marginallabel distribution:

Pt(y = 1) =1

t

t∑s=1

ymaps(3.18)

1. Sample from the initial distribution φi0 ∼ p(φ0) for

i = 1, . . . , N and assign weights wi0 = 1

N

2. For t = 1, initialize P1(y) = 0.5 and run a singleiteration of the basic filtering algorithm (Figure 2)for j = 1, . . . ,K.

3. Initialize h(j) = 1 and Zj1 = {zj

1} ∀j = 1, . . . ,K

4. For t > 1, and i = 1, . . . , N

• Compute Pt(y) acc. to (3.18).

• Draw samples φj,it using (3.13) and estimate

the expected accuracy via (3.16) for j =1, . . . ,K.

• Select the labelers J(t) to be queried

• For j ∈ J(t)

– Update weights wj,it = p(zj

t | φj,it )wj,i

t−h(j)

– Normalize the weights wj,it =

wj,i

t−h(j)P

Nl=1 w

j,l

t−h(j)

and compute Nej = 1P

Ni=1(w

j,it )2

.

– If Nej is too small, then resample φj,it ∼

pmf[{wjt}] and reassign wj,i

t = 1N

.

– Update h(j) = t and Zjt = Zj

t−h(j) ∪{zjt }.

– Update the posterior state densityp(φj

t | Zjt ) =

∑N

i=1 wj,it δ(φj

t − φj,it ).

5. Update t = t + 1 and go to step 4.

Figure 3: Outline of the SFilter algorithm.

where ymap ∈ {0, 1}. Initially, we start with a uniformmarginal label distribution.

See Figure 3 for an outline of our algorithm, whichwe name SFilter.

4 Evaluation

In this section, we describe the experimental evaluationand offer our observations. We have tested SFilter frommany different perspectives. The labelers are simulatedin such a way that they have different initial labelingaccuracies but share the same σ. Furthermore, eachinstance that we seek to label corresponds to a singletime step; hence, the accuracy transitions to the nextstate at every instance. These conditions are true forall experiments unless otherwise noted.

First, we conducted an evaluation to test how theproposed model behaves with respect to various degreesof change in accuracy. The degree of this changedepends on the variance of the Gaussian distribution

Table 1: Performance measurement of SFilter w.r.tincreasing σ values.

σ %-correct # queries # instances0.02 0.858 975 5000.04 0.846 1152 5000.06 0.834 1368 500

Table 2: Robustness analysis of SFilter against smallperturbations in estimated vs. true σ when the trueσ is equal to 0.02. The analysis is performed over 500instances drawn from P (y = 1) = .75

σ %-correct # queries0.02 0.858 9750.022 0.854 10150.018 0.854 10270.016 0.845 10310.024 0.848 1082

that controls the width of the step size in (3.1). Weanalyzed how σ affects the estimation process in termsof the overall quality of the integrated labels and thetotal labeling effort. We simulated 10 labelers withvarying initial accuracies but with the same σ for500 instances. The accuracy of each labeler evolvesaccording to (3.2) and every selected labeler j ∈ Jt

outputs a noisy label zjt . We integrate these labels

to predict the true label via (3.17). Table 1 showsthe ratio of correct labels predicted and the numberof total labeler queries with the corresponding σ. %-correct represents the ratio of the correct labelingsamong all instances m = 500, i.e. % − correct =Pm

t=1 I(ymapt =ytrue

t )

mwhere ytrue

t is drawn from P (y =1) = 0.75. Among the 10 labelers the highest averageaccuracy is 0.81 whereas the percentage of correctlabels integrated by SFilter is significantly higher, i.e.85%. As the variance increases, the total number ofqueries also increases and there is a slight decay inthe ratio of correct labelings. The relatively rapidchange of labeler qualities leads to more extensiveexploration to catch any drifts without compromisingthe overall performance too much. We have also testedthe robustness of the proposed algorithm against smallperturbations in σ. We repeated the above analysisusing σ = 0.02 to generate the accuracy drift, butexecuted SFilter using slightly different σ values in orderto test perturbational robustness. Table 2 indicates thatSFilter is robust to small noise in σ. The %-correctremains relatively stable as we add noise to the true σwhile the total amount of labeling is only slightly larger.

Next, we analyzed how the variations among the

Table 3: Performance measurement of SFilter for vari-ous quality labelers. Uniform denotes the labelers’ ini-tial accuracies are uniformly distributed. Skewed de-notes there are only a few good labelers while the ma-jority are highly unreliable.

distribution %-correct # queries P (y = 1)uniform 0.858 975 0.768skewed 0.855 918 0.773

individual labeler qualities affect the estimation andselection. The first experiment uses the same committeeof labelers (k = 10) as above whose qualities areuniformly distributed between 0.5 and 1. The secondexperiment uses a committee where only a few goodlabelers exists (only 3 labelers with initial accuracyabove 0.8 and the rest below 0.65). We used a fixed σ =0.02 for both cases. Table 3 shows the effectiveness ofSFilter to make correct label predictions with moderatelabeling effort in both cases. The last column in Table 3shows the average estimated marginal label distributionP (y = 1) where the true marginal is P (y = 1) =0.75. Our algorithm is very effective for estimating theunknown label distribution and also for exploiting thelabelers only when they are highly reliable as shown inFigure 4. Figure 4 displays the true source accuracyover time and the times the source is selected by SFilter(denoted by red circles) and when otherwise (denoted byblue circles). We only report a representative subset outof 10 total sources. Note that SFilter selects the highlyreliable sources in this skewed set at the beginning.When these sources become less reliable, SFilter findsother sources that have become more favorable overtime (i.e. lower left corner in Figure 4). Moreover,sources that have relatively low quality throughout theentire time are hardly selected by SFilter except a fewexploration attempts. This is a desirable behaviour ofthe algorithm and certainly reduces the labeling effortwithout hurting overall performance even with skeweddistributions of source accuracies.

We further challenged SFilter with a pathologicalexample to test its robustness against how the labelers’accuracies change. We simulated k = 10 labelers withinitial accuracies ranging from 0.51 to 0.94. All the la-belers except the best (0.94 accuracy) and the worst(0.51 accuracy) follow a random walk with the step sizedrawn from N (0, σ2) where σ = 0.02. The best labelerconsistently gets worse and the worst consistently im-proves, both staying within 0.5 and 1 range. As before,P (y = 1) = 0.75. Figure 5 displays the accuracy changeof the two labelers with respect to time. The red dotsindicate the times that the corresponding labeler is se-

1 100 200 300 4000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sour

ce a

ccur

acy

selectednot selected

1 100 200 300 4000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sour

ce a

ccur

acy


1 100 200 300 4000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sour

ce a

ccur

acy


1 100 200 300 4000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sour

ce a

ccur

acy


Figure 4: SFilter selects the sources when they are relatively highly accurate even in the skewed case wherethere are only a few good sources to begin with. SFilter also detects when the initially bad sources become morefavorable later and begins exploiting them.

lected and the blue dots indicate otherwise. Clearly,SFilter is able to detect the consistent change in theaccuracies. We also note that the most frequently cho-sen labelers are the ones with relatively high averageaccuracies. There are occasional exploration attemptsat the low accuracies, but the algorithm detects thisand is able to recover quickly. As a result, the algo-rithm chooses any undesirable labeler(s) with very lowfrequency. Please note that there are times when nei-ther 4 of labelers is selected. This corresponds to thesituations where the remaining labelers are selected.

Additionally, we tested the ability of SFilter to trackthe true accuracy. We tested SFilter using 4 labelerswith the rate of change σ = 0.01 on 1000 instancesdrawn from a true label distribution of P (y = 1) = 0.8.We used 5000 particles and we constrained SFilter tochoose all 4 labelers per instance (only for this experi-ment) to monitor the true and the estimated accuracyof each labeler at every time step. Figure 6 demon-strates how the estimate of the marginal label distribu-tion P (y = 1) improves over time. Figure 7 comparesthe true and the estimated accuracy of each labeler.The dotted blue line in each graph corresponds to the

1 250 500 750 10000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

estim

ate

of P

(y=1

)

Figure 6: SFilter’s estimate of the true marginal labeldistribution P (y = 1) = 0.8 improves over time. In fact,it converges to the true distribution with more samples.

estimated expected accuracy while the solid red line cor-responds to the true accuracy. The black line in eachgraph shows the result of using maximum likelihood es-timation to infer the accuracy assuming accuracy is sta-

0 100 200 300 400 5000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sou

rce

accu

racy


0 100 200 300 400 5000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

true

sou

rce

accu

racy


Figure 5: SFilter is able to detect the consistent increase or decrease in any quality and adopts quickly by choosingthe labelers when they are reliable, though with occasional exploration attempts.

tionary. This inference technique proposed in [6] showsvery promising results on various test scenarios. How-ever, it is designed to estimate stationary source accu-racies. SFilter, on the other hand, manages to track thetrends in the true accuracy quite well, sometimes with alag. This is remarkable since neither the initial qualitiesnor the true label distribution are known in advance.

Next, we tested SFilter in terms of its effectivenessto create a high quality labeled data for training aclassifier. We took 4 benchmark datasets from theUCI repository[2]. We simulated k = 10 differentlabelers and executed SFilter on each dataset withthese labelers and integrated their output for a finallabeling. Later, we trained a logistic regression classifieron this labeled set and tested the resulting classifier ona separate held-out data. The gold standard labels forthe held-out set are known but used solely to test theclassifier not to adjust the estimation model in any way.We compared the proposed approach with two strongbaselines. One is majority voting (denoted as Majority)where we used all labelers per instance and assignedthe majority vote as the integrated label. The otheris a technique proposed in [5] (denoted as IEThresh).IEThresh selects a subset of labelers based on an intervalestimation procedure and takes the majority vote ofonly the selected labelers. We report on the held-out setthe accuracy of the classifier trained on the data labeledby each method. We report the classifier performancewith respect to the number of queries. Hence, thecorresponding size of the labeled examples differs foreach method since each method uses different amountof labeling per instance, i.e. majority voting uses allannotators to label each instance. The total trainingsize is fixed at 500 examples for each dataset. The

Table 4: Properties of the UCI datasets. All are binaryclassification problems.

dataset test size dimension +/- ratiophoneme 3782 5 0.41spambase 3221 57 0.65ada 4562 48 0.33image 1617 18 1.33

other properties of the datasets are given in Table 43.Figure 8 displays the performance of our algorithm

SFilter and two baselines on four UCI datasets. Theinitial labeler accuracies range from as low as 0.53to as high as 0.82. SFilter is significantly superiorover the entire operating range; i.e. from small tolarge numbers of labeler queries. The differences arestatistically significant based on a two-sided paired t-test (p < 0.001). The horizontal black line in each plotrepresents the test accuracy of a classifier trained withthe gold standard labels. SFilter eventually reaches thegold standard level, suggesting that it generates a cleantraining data with minimal noise. Majority voting isthe worst performer since it uses all labelers for everysingle instance; thus makes a larger number of labelingrequests even for a small number of instances. IEThreshis the second best performer after SFilter. However,neither baselines take the time-varying accuracy intoaccount and wastes labeling effort due to low qualitylabelers. Our method adapts to the change in labeleraccuracies and filters the unreliable ones leading to a

3‘ada’ is a different version of the ‘adult’ datasetat UCI repository. This version is generated for the

agnostic learning workshop at IJCNN ’07, available at

http://clopinet.com/isabelle/Projects/agnostic/index.html

1 250 500 750 10000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

labe

ling

accu

racy

predictor 1

SFilterTrueMLE

1 250 500 750 10000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

labe

ling

accu

racy

predictor 2

SFilterTrueMLE

1 250 500 750 10000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time t

labe

ling

accu

racy

predictor 3

SFilterTrueMLE

1 250 500 750 10000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

labe

ling

accu

racy

time t

predictor 4

SFilterTrueMLE

Figure 7: Shows how SFilter tracks the true labeler accuracy. Each graph corresponds to a different labeler. Thesolid red line indicates the true accuracy of the labeler whereas the dotted blue line shows the expected accuracyestimated by SFilter. This is the result of a single run, though highly typical. SFilter is able to track the tendencyof the true accuracy quite well with occasional temporal lags.

more effective estimation.Finally, we are interested in testing the proposed

framework on a different situation. It is often problem-atic to estimate the error rate (or accuracy) of a clas-sifier in the absence of gold standard labels. Considerclassifiers that are re-trained (or modified) as additionalinstances become available. This situation might arisein a number of real-life scenarios. Consider spam fil-ters. They need to be re-trained in an online fashion asnew emails arrive. Web pages or blogs are other goodexamples of constantly growing databases. Any webpage classifier needs to be modified to accomodate forthe newly created web pages. Another example is au-tonomous agents that need to learn continuously fromtheir environments and their performance will be chang-ing with new discoveries of their surroundings. However,such learners might not always improve with additionaldata. The new stream of data might be noisy, gener-ating confusion for the learners and causing a perfor-

mance drop. Hence, capturing the time-varying qualityof a learner while maintaining flexibility in terms of thedirection of the change becomes crucial to adopt to thisnon-stationary environment.

To study this phenomenon, we evaluated the per-formance of SFilter on a face recognition dataset intro-duced in [8]. The total size of the face data is 2500 with400 dimensions and P (y = face) = 0.66. We randomlydivided the whole dataset into 3 mutually exclusive sub-sets. One contains 100 labeled images to train the clas-sifier. The other subset contains 500 unlabeled imagesto test the classifiers. The last subset is held out for val-idation. We trained 5 linear SVM classifiers as follows.For each classifier, we used a separate randomly chosensubset of 100 images as the initial training set. We sys-tematically increased each training set size to 100, byadding random label noise so that the classifiers do notnecessarily improve with more data. The average ac-curacy of classifiers are {0.70, 0.73, 0.78, 0.83, 0.84}. We

200 400 600 800 1000 12000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

number of labelings

test

acc

urac

yada

SFilterIEThreshMajorityGold

200 400 600 800 1000 1200

0.65

0.7

0.75

0.8

0.85

number of labelings

test

acc

urac

y

image


200 300 400 500 600 700 800 900 1000 11000.5

0.55

0.6

0.65

0.7

0.75

0.8

number of labelings

test

acc

urac

y

phoneme


200 400 600 800 1000 12000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of labelings

test

acc

urac

y

spambase


Figure 8: Measuring predicted label quality by training and testing a classifier on 4 UCI datasets by all threemethods. The y-axis indicates the accuracy of the classifier on the test set, and the x-axis indicates the numberof labeler queries. The horizontal line shows the performance of a classifier trained on the gold standard labels.

executed SFilter, IEThresh and Majority voting on 500test images to predict their labels. We then trained an-other classifier on the predicted labels by each method.Figure 9 compares the performance of SFilter againstMajority voting. SFilter significantly outperforms otherbaselines in this task and reaches a level of performanceclose to that of using gold standard labels. Again, goldlabels are solely used to measure the performance ofclassifiers, and not for learning.

5 Discussion and Future Work

This paper addresses the challenging problem of esti-mating time-varying accuracy of multiple sources andselecting the most reliable ones for querying. The pro-posed framework constantly monitors and estimates theexpected behaviour of each predictor while simultane-ously choosing the best possible predictors for label-ing each instance. The framework relies on a HiddenMarkov Model (HMM) structure to represent the un-

known sequence of changing accuracies and the corre-sponding observed predictor outputs. We adopted theBayesian particle filtering to make inference in sucha dynamic system. Particle filters estimate the statedistribution by sets of weighted samples and performBayes filter updates according to a sequential samplingprocedure. The key advantage of particle filters istheir capacity to represent arbitrary probability densi-ties whereas other Bayesian filters such as Kalman filterscan only handle unimodal distributions and hence notwell-suitable for our problem. Furthermore, we haveproposed a variant of the basic filtering method to inferthe expected accuracy of each predictor and use it as aguide to decide which ones to query at each time step.The proposed approach, SFilter, chooses the potentiallygood predictors based on their expected accuracies ateach time step. In general high quality labelers arequeried more frequently than the low-quality labelers.Moreover, it tracks reliably the true labeling accuracyover time.

100 200 300 400 500 600 700 800 9000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of labelings

test

acc

urac

yface recognition


Figure 9: Results using 5 actual classifiers as labelers.The performance of each classifier varies over time.Their outputs on a test set are used to predict the truelabels. A meta-classifier is trained using these labels andtested on a held-out set for performance evaluation. Thepredicted labels obtained by SFilter are significantlymore effective in improving the data quality.

The effectiveness of the proposed framework isdemonstrated by thorough evaluation. For instance, wehave shown that our data labeling process is robust tothe choice of σ which governs the rate of the accuracychange. We have further shown that it is also robust tothe distribution of the initial predictor accuracies. Thealgorithm is able to detect the most reliable predictorsregardless of the reliability distribution. Additionally,our analysis demonstrates the ability of SFilter togenerate high quality labels for a training data forwhich only multiple noisy annotations exist. This resultis powerful in the sense that even though SFilter isprovided with noisy labels generated from dynamicallychanging sources, it can label the data with highreliability. This is particularly useful for many datamining applications where labeled data with minimalnoise is essential to build reliable software and services.

There are a number of potential future directionsthat open interesting new challenges. In this work,we assumed the rate of change (σ) of the state (or atleast its bound) is known and it is the same for allpredictors. It is likely that this assumption may nothold for some cases. Then, one should also estimateσ for each predictor to estimate the expected state ateach time step. This is a harder problem but definitelya necessary one to solve to make the method applicableto a wider range of data mining scenarios. Anotherpoint is that in this work we deal with predictors whoseaccuracies change based on external factors such as

fatigue, learning, etc. The existing framework can beformulated in a way to allow the accuracy to shift basedon the instance such as the difficulty, ambigiuity, out-of-expertise, and so on. This introduces further challengesin terms of formulating the state transition density andthe observation likelihood. We plan to investigate theseextensions in the future.

References

[1] S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp.A tutorial on particle filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions onSignal Processing, 50:174–188, 2001.

[2] Blake and C. J. Merz. UCI repository of machinelearning databases, 1998.

[3] J. Carpenter, P. Clifford, and P. Fearnhead. Improvedparticle filter for non-linear problems. In IEEE Pro-ceedings on Radar and Sonar Navigation, 1999.

[4] O. Dekel and O. Shamir. Good learners for evilteachers. In Proceedings of the 26th InternationalConference on Machine Learning, June 2009.

[5] P. Donmez, J. G. Carbonell, and J. Schneider. Ef-ficiently learning the accuracy of labeling sources forselective sampling. In Proceedings of the 15th ACMSIGKDD Conference on Knowledge Discovery andData Mining (KDD ’09), pages 259–268, 2009.

[6] P. Donmez, G. Lebanon, and K. Balasubramanian.Unsupervised estimation of classification and regres-sion error rates. Technical Report CMU-LTI-09-15,Language Technologies Institute, Carnegie Mellon Uni-versity, Pittsburgh, PA.

[7] A. Doucet and D. Crisan. A survey of convergenceresults on particle filtering for practitioners. IEEETransactions on Signal Processing, 2002.

[8] T. Pham, M. Worring, and A. Smeulders. Facedetection by aggregated bayesian network classifiers.Pattern Recognition Letters, 23.

[9] V. C. Raykar, S. Yu, L. Zhao, A. Jerebko, C. Florin,G. Valadez, L. Bogoni, and L. Moy. Supervised learn-ing from multiple experts: Whom to trust when every-one lies a bit. In Proceedings of the 26th InternationalConference on Machine Learning, pages 889–896, June2009.

[10] B. Ripley. Stochastic Simulation. Wiley, New York,1987.

[11] V. Sheng, F. Provost, and P. G. Ipeirotis. Get anotherlabel? improving data quality and data mining usingmultiple, noisy labelers. In Proceedings of the 14thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (KDD ’08), pages614–622, 2008.

[12] R. Snow, O’Connor, D. Jurafsky, and A. Ng. Cheapand fast—but is it good? evaluating non-expert an-notations for natural language tasks. In Proc. of theConference on Empirical Methods in Natural LanguageProcessing (EMNLP), 2008.

a probabilistic framework to learn from multiple annotators with time-varying...

Documents