budget-optimal task allocation for reliable crowdsourcing ... · background. crowdsourcing systems...

Budget-Optimal Task Allocation

for Reliable Crowdsourcing Systems

David R. Karger, Sewoong Oh, and Devavrat ShahDepartment of EECS, Massachusetts Institute of Technology

Email: karger, swoh, [email protected]

July 30, 2019

Abstract

Crowdsourcing systems, in which numerous tasks are electronically distributed to numerous“information piece-workers”, have emerged as an effective paradigm for human-powered solvingof large scale problems in domains such as image classification, data entry, optical characterrecognition, recommendation, and proofreading. Because these low-paid workers can be un-reliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers,typically by assigning each task multiple times and combining the answers in some way such asmajority voting.

In this paper, we consider a general model of such crowdsourcing tasks and pose the problemof minimizing the total price (i.e., number of task assignments) that must be paid to achievea target overall reliability. We give a new algorithm for deciding which tasks to assign towhich workers and for inferring correct answers from the workers’ answers. We show thatour algorithm, inspired by belief propagation and low-rank matrix approximation, significantlyoutperforms majority voting and, in fact, is asymptotically optimal through comparison to anoracle that knows the reliability of every worker. We consider both a one-shot scenario in whichall questions are asked and answered simultaneously and an iterative scenario in which one maychoose to gather additional responses to certain questions adaptively based on the responsescollected thus far. Perhaps surprisingly, we show that the minimum price that must be paidin order to achieve a certain reliability under both scenarios scale in the same manner, whichimplies that there is no significant gain in asking questions adaptively.

1

arX

iv:1

110.

3564

v2 [

cs.L

G]

8 N

ov 2

011

1 Introduction

Background. Crowdsourcing systems have emerged as an effective paradigm for human-poweredproblem solving and are now in widespread use for large-scale data-processing tasks such as imageclassification, video annotation, form data entry, optical character recognition, translation, recom-mendation, and proofreading. Crowdsourcing systems such as Amazon Mechanical Turk1 marketwhere a “taskmaster” can submit batches of small tasks to be completed for a small fee by anyworker choosing to pick them up. For example a worker may be able to earn a few cents by indi-cating which images from a set of 30 are suitable for children (one of the benefits of crowdsourcingis its applicability to such highly subjective questions).

Because the tasks are tedious and the pay is low, errors are common even among workerswho make an effort. At the extreme, some workers are “spammers”, submitting arbitrary answersindependent of the question in order to collect their fee. Thus, all crowdsourcers need strategiesto ensure the reliability of answers. Because the worker crowd is large, anonymous, and transient,it is generally difficult to build up a trust relationship with particular workers.2 It is also difficultto condition payment on correct answers, as the correct answer may never truly be known anddelaying payment can annoy workers and make it harder to recruit them to your task. Instead,most crowdsourcers resort to redundancy, giving each task to multiple workers, paying them allirrespective of their answers, and aggregating the results by some method such as majority voting.

For such systems there is a natural core optimization problem to be solved. Assuming thetaskmaster wishes to achieve a certain reliability in their answers, how can they do so at minimumcost (which is equivalent to asking how they can do so while asking the fewest possible questions)?

Several characteristics of crowdsourcing systems make this problem interesting. Workers areneither persistent nor identifiable; each batch of tasks will be solved by a worker who may becompletely new and who you may never see again. Thus one cannot identify and reuse particularlyreliable workers. Nonetheless, by comparing one worker’s answer to others’ on the same question,it is possible to draw conclusions about a worker’s reliability, which can be used to weight theiranswers to other questions in their batch. However, batches must be of manageable size, obeyinglimits on the number of tasks that can be given to a single worker.

Another interesting aspect of this problem is the choice of task assignments. Unlike manyinference tasks which makes inferences based on a fixed set of signals, our algorithm can choosewhich signals to measure by deciding which questions to include in which batches. In addition, thereare several plausible options: for example, we might choose to ask few “pilot questions” to eachworker (just like a qualifying exam) to decide on the reliability of worker. However, this only biasesthe prior distribution of the workers’ reliability. Another possibility is to first ask few questionsand based on the answers decide to ask more questions or not. We would like to understand therole of all such variations in the overall optimization of budget for reliable task processing.

In the remainder of this introduction, we will define a formal model that captures these aspectsof the problem. We consider both a one-shot scenario in which all questions are asked and answeredsimultaneously and then the correct answers are inferred based on all the responses, and an iterativescenario in which one may choose to gather additional responses to certain questions adaptivelybased on the responses collected thus far. We, somewhat surprisingly, show that the optimal costs

1http://www.mturk.com2For certain high-value tasks, crowdsourcers can use entrance exams to “prequalify” workers and block spammers,

but this increases the cost of the task and still provides no guarantee that the workers will try hard after qualification.

2

under both scenarios scale in the same manner. That is, asking questions iteratively does not help!We design a non-adaptive algorithm which operates under the one-shot scenario, and show thatthis approach is optimal for both scenarios. Hence, there is no gain in switching to an adaptivealgorithm. We provide an optimal task allocation scheme and an optimal algorithm for inference onthat task allocation. We will then show that our algorithm is asymptotically order-optimal underboth scenarios: for a given target error rate, it spends only a constant factor times the minimumnecessary to achieve that error rate. The optimality is established through comparisons to an oraclethat knows the reliability of every worker and can thus make optimal decisions. In particular, wederive a parameter q that characterizes the ‘collective’ reliability of the crowd, and show that toachieve target reliability ε it is both necessary and sufficient to replicate each task Θ(1/q log(1/ε))times.

Setup. We model a set of m tasks tii∈[m] as each being associated with unobserved ‘correct’solutions si ∈ ±1. Here and after, we use [N ] to denote the set of first N integers. In theimage categorization example stated earlier, tasks corresponds to labeling m images as suitablefor children (+1) or not (−1). These tasks are assigned to n anonymous workers from the crowd,which we denote by wjj∈[n].

When a task is assigned to a worker, we get a possibly inaccurate response. We use Aij ∈ 0,±1to denote this response on task ti from worker wj . To simplify notations, we use 0 to denote thatthe particular task is not assign to that worker. Some workers are more diligent or have moreexpertise than others, while some other workers might be spammers. We choose a simple model tocapture this diversity in workers’ reliability: we assume that each worker wj is characterized by areliability pj ∈ [0, 1], and that they make errors randomly on each question they answer. Precisely,if task ti is assigned to worker wj then

Aij =

si with probability pj ,−si with probability 1− pj ,

and Aij = 0 if ti is not assigned to wj . The random variable Aij is independent of any otherevent given pj . (Throughout this paper, we use boldface characters to denote random variablesand random matrices unless it is clear from the context.) The underlying assumption here is thatthe error probability of a worker does not depend on the particular task and all the tasks share anequal level of difficulty. Hence, each worker’s performance is consistent across different tasks. Wediscuss a possible generalization of this model in Section 2.5.

To distribute tasks to anonymous workers from the crowd, the taskmaster creates batches oftasks, which are distributed through a crowdsourcing platform through an open-call to an uniden-tified pool of workers. Each batch, consisting of a small set of tasks, is picked up by any workerchoosing to complete it. Although we have no control over who gets to work on which batch, wehave complete control over which tasks are included in which batch, which we refer to as the choiceof task assignment. Assigning tasks to the batches amounts to choosing a bipartite graph G with mtask nodes and n batch nodes (or worker nodes), where an edge indicates that the task is assignedto that particular batch. Equivalently, we can think of the graph as connecting tasks to workers,since each batch is to be completed by a single worker (who is not identified at the time of graphgeneration). Since we do not have any control over who picks up the batches, we assume that eachbatch is completed by a worker who is drawn from an i.i.d. distribution and that we do not haveany a priori information on how reliable that worker is.

3

Formally, we assume that the reliability of workers pjj∈[n] are independent and identicallydistributed random variables with a given distribution on [0, 1]. One example is spammer-hammermodel where, each worker is either a ‘hammer’ with probability q or is a ‘spammer’ with probability1−q. A hammer answers all questions correctly, in which case pj = 1, and a spammer gives randomanswers, in which case pj = 1/2. Another example is the beta distribution with some parametersα > 0 and β > 0 (f(p) = pα−1(1− p)β−1/B(α, β) for a proper normalization B(α, β)). Given thisrandom variable pj , we define an important parameter q ∈ [0, 1], which captures the ‘collectivequality’ of the crowd:

q ≡ E[(2pj − 1)2] .

A value of q close to one indicates that a large proportion of the workers are diligent, whereas q closeto zero indicates that there are many spammers in the crowd. The definition of q is consistent withuse of q in the spammer-hammer model and in the case of beta distribution, q = 1 − (4αβ/((α +β)(α+β+ 1))). We will see later that our bound on the error rate of our inference algorithm holdsfor any distribution of pj but depends on the distribution only through this parameter q.

It is quite realistic to assume the existence of a prior distribution for pj . The model is thereforequite general: in particular, it is met if we simply randomize the order in which we upload ourtask batches, since this will have the effect of randomizing which workers perform which batches,yielding a distribution that meets our requirements. On the other hand, it is not realistic to assumethat we know what the prior is. To execute our inference algorithm for a given number of iterations,we do not require any knowledge of the distribution of the reliability. However, q is necessary inorder to determine how many times a task should be replicated and how many iterations we needto run to achieve certain reliability. We discuss a simple way to overcome this limitation in Section2.5.

We first consider a one-shot scenario, in which all questions are asked simultaneously and thenan estimation is performed after all the answers are obtained. In particular, we do not allowallocating tasks adaptively based on the answers received thus far (e.g. which workers are morereliable or which tasks we have less confidence in). In practice, latency is a crucial factor, and to beable to deliver a solution in time, all the tasks are typically assigned at once so that all the batchesare processed in parallel.

However, it is also of great practical interest to ask how much we can gain by using an adaptivetask allocation scheme. Hence, we next compare our approach to a more general class of schemesthat operate under an iterative scenario, in which one can adaptively assign tasks. In this case, onemight be tempted to first identify which workers are more reliable and then assign all the tasks tothose workers in an explore/exploit manner. However, in crowdsourcing, it is unrealistic to assumethat we can identify and reuse any particular worker, including the reliable ones, since typicalworkers are neither persistent nor identifiable and batches are distributed through an open-call.Hence, under an iterative scenario in crowdsourcing, although we cannot reuse any of the workers,we can adaptively assign more workers to those tasks that we have less confidence in, by assigningthose tasks to the new batches that we create adaptively. These new batches are then distributedto unidentified workers through an open-call using a crowdsourcing platform.

Prior Work. Previously, crowdsourcing system designs have focused on developing inferencealgorithms assuming that the graph is fixed and the workers’ responses are already given. None ofthe prior work on crowdsourcing provides any systematic treatment of the task allocation. To the

4

best of our knowledge, we are the first to study both aspects of crowdsourcing together and, moreimportantly, establish optimality.

A naive approach to solve the inference problem, which is widely used in practice, is majorityvoting. Majority voting simply follows what the majority of workers agree on. When we have manyspammers in the crowd, majority voting is error-prone since it gives the same weight to all theresponses, regardless of whether they are from a spammer or a diligent workers. We will show inSection 2.3 that majority voting is provably sub-optimal and can be significantly improved upon.

If we know how reliable each worker is, then it is straightforward to find the optimal estimates:compute the weighted sum of the responses weighted by the log-likelihood. Although, in reality,we do not have this information, it is possible to learn about a worker’s reliability by comparingone worker’s answer to those of others’. This idea was first proposed by Dawid and Skene, whointroduced an iterative algorithm based on expectation maximization (EM) [DS79]. This heuristicalgorithm iterates the following two steps. In the M-step, the algorithm estimates the error prob-abilities of the workers that maximizes the likelihood using the current estimates of the answers.In the E-step, the algorithm estimates the likelihood of the answers using the current estimates ofthe error probabilities.

More recently, a number of algorithms followed this EM approach based on a variety of prob-abilistic models [SFB+95, WRW+09, RYZ+10]. The EM algorithm has also been widely appliedin classification problems, where a set of labels from low-cost noisy labelers is used to find a goodclassifier [JG03, RYZ+10]. Given a fixed budget, there is a trade-off between acquiring a largertraining dataset or acquiring a smaller dataset but with more labels per data point. Through ex-tensive experiments, Sheng, Provost and Ipeirotis [SPI08] show that getting repeated labeling cangive considerable advantage.

Despite the popularity of the EM algorithms, the performance of these approaches are onlyempirically evaluated and there is no analysis that gives performance guarantees. In particular,EM algorithms are highly sensitive to the initialization used, making it difficult to predict thequality of the resulting estimate. Further, the role of the graph structure G is not at all understoodwith the EM algorithm (or for that matter any other algorithm).

Contributions. In this work, we provide the first rigorous treatment of both aspects of designinga reliable and cost-efficient crowdsourcing: task allocation and inference. Our approach aims tominimize the budget to achieve completion of a set of tasks to meet a certain target reliability.We provide both an asymptotically optimal graph construction (based on random graphs) andan asymptotically optimal algorithm for inference (based on low-rank approximation and beliefpropagation) on that graph. As the main result, we show that our algorithm performs as goodas an oracle estimator. The surprise lies in the fact that we establish this result by comparingour algorithm with an oracle estimator which is free to choose any graph and performs optimalestimation on that graph based on the information provided by an oracle about reliability of everyworker.

We further show that the budget necessary to achieve a certain reliability in the answers usingan oracle estimator scales in the same manner for both one-shot and iterative scenarios. Ournon-adaptive approach still performs as good as the best possible algorithm, even if we allow thisbest algorithm to assign tasks adaptively. Hence, there is no significant gain in using an adaptivestrategy.

Another novel contribution of our work is the analysis technique. The iterative inference al-

5

gorithm we introduce operates on real-valued messages whose distribution is a priori difficult toanalyze using the standard technique. To overcome this challenge, we develop a novel techniqueof establishing that these messages are sub-Gaussian and compute the parameters recursively in aclosed form. This allows us to prove the sharp result on the error rate. This technique could be ofindependent interest in analyzing a more general class of message-passing algorithms.

2 Main Result

To achieve a certain reliability in our answers with minimum cost, we want to design a reliable andcost-efficient crowdsourcing system. In what follows, we propose using random regular graphs fortask assignment and introduce a novel iterative algorithm to infer the correct answers. We provea bound on the resulting error and show that our approach is asymptotically order-optimal underboth one-shot scenario and iterative scenario: it requires only a constant factor times the minimumnecessary budget to achieve a target error rate.

2.1 Algorithm

Task allocation. Assigning tasks amounts to designing a bipartite graphG(tii∈[m]∪wjj∈[n], E

),

where each edge corresponds to a task-worker assignment. We propose using a regular bipartitegraph chosen uniformly at random with bounded degrees. Sparse random graphs have several goodproperties. In the large system limit, sparse random graphs converge locally to a tree, which allowsus to do sharp analysis of our algorithm. But more importantly, random graphs are excellent ex-panders with large spectral gaps, which enables us to reliably separate the low-rank structure fromthe data which is perturbed by random noise.

As a taskmaster, one first makes a choice of how many workers to assign to each task (theleft degree l) and how many tasks to assign to each worker (the right degree r). Since the totalnumber of edges has to be consistent, the number of workers n directly follows from ml = nr. Togenerate an (l, r)-regular bipartite graph we use a random graph generation scheme known as theconfiguration model in random graph literature [RU08, Bol01].

Inference algorithm. To solve the inference problem, we introduce a novel iterative algorithmwhich is inspired by the celebrated belief propagation algorithm and low-rank matrix approxima-tion. This algorithm operates on real-valued task messages xi→j(i,j)∈E and worker messagesyj→i(i,j)∈E . It starts with the worker messages initialized as independent Gaussian random vari-ables. The algorithm is not sensitive to a specific initialization as long as it has a strictly positivemean. At each iteration k ∈ 1, . . ., the messages are updated according to

x(k)i→j =

∑j′∈∂i\j

Aij′y(k−1)j′→i , for all (i, j) ∈ E, and

y(k)j→i =

∑i′∈∂j\i

Ai′jx(k)i′→j , for all (i, j) ∈ E

where ∂i is the neighborhood of ti and ∂j is the neighborhood of wj . Intuitively, a worker messageyj→i represents our belief on how ‘reliable’ the worker j is, such that our final estimate is a weighted

6

sum of the answers weighted by each worker’s reliability:

si = sign(∑j∈∂i

Aijyj→i

).

It is understood that when there is a tie we flip a fair coin to make a decision.

Iterative Algorithm

Input: E, Aij(i,j)∈E , kmax

Output: Estimate s(Aij

)1: For all (i, j) ∈ E do

Initialize y(0)j→i with random Zij ∼ N (1, 1) ;

2: For k = 1, . . . , kmax do

For all (i, j) ∈ E do x(k)i→j ←

∑j′∈∂i\j Aij′y

(k−1)j′→i ;

For all (i, j) ∈ E do y(k)j→i ←

∑i′∈∂j\iAi′jx

(k)i′→j ;

3: For all i ∈ [m] do xi ←∑

j∈∂iAijy(kmax−1)j→i ;

4: Output estimate vector s(Aij

)= [sign(xi)] .

We emphasize here that our inference algorithm requires no information about the prior distri-bution of the workers’ quality pj . Our algorithm is inspired by power iteration used to computethe leading singular vectors of a matrix, and we discuss the connection in detail in Section 2.6.

While our algorithm is also inspired by the standard Belief Propagation (BP) algorithm for ap-proximating max-marginals [Pea88, YFW03], our algorithm is original and overcomes a few criticallimitations of the standard BP. First, the iterative algorithm does not require any knowledge ofthe prior distribution of pj , whereas the standard BP requires the knowledge of the distribution.Second, there is no efficient way to implement standard BP, since we need to pass sufficient statis-tics (or messages) which under our general model are distributions over the reals. On the otherhand, the iterative algorithm only passes messages that are real numbers regardless of the priordistribution of pj , which makes it easy to implement. Third, the iterative algorithm is provablyasymptotically order-optimal. Density evolution, is a standard technique to analyze the perfor-mance of BP. Although we can write down the density evolution for the standard BP, we cannotdescribe or compute the densities, analytically or numerically. It is also very simple to write downthe density evolution equations (cf. (8)) for our algorithm, but it is not a priori clear how one cananalyze the densities in this case either. We develop a novel technique to analyze the densities forour iterative algorithm and prove optimality. This technique could be of independent interest toanalyzing a broader class of message-passing algorithms.

2.2 Performance guarantee

We state the main analytical result of this paper: for random (l, r)-regular bipartite graph basedtask assignments with our iterative inference algorithm, the probability of error decays exponentiallyin lq, up to a universal constant and for a broad range of the parameters l, r and q. With areasonable choice of l = r and both scaling like (1/q) log(1/ε), the proposed algorithm is guaranteedto achieve error less than ε for any ε ∈ (0, 1/2). Further, an algorithm independent lower bound

7

that we establish suggests that such an error dependence on lq is unavoidable. Hence, in terms ofthe total budget, our algorithm is order-optimal. The precise statements follow next.

Define a parameter µ ≡ E[2pj − 1] and recall that q = E[(2pj − 1)2]. To lighten the notation,

let l ≡ l − 1 and r ≡ r − 1. Define

ρ2k ≡ 2q

µ2(q2 lr)k−1+(

3 +1

qr

)1− (1/q2 lr)k−1

1− (1/q2 lr).

For q2 lr > 1, let ρ2∞ ≡ limk→∞ ρ

2k such that

ρ2∞ =

(3 +

1

qr

) q2 lr

q2 lr − 1.

Then we can show the following bound on the probability of making an error.

Theorem 2.1. For fixed l > 1 and r > 1, assume that m tasks are assigned to n = ml/r workersaccording to a random (l, r)-regular graph drawn from the configuration model. If the distributionof the worker reliability satisfy µ ≡ E[2pj − 1] > 0 and q2 > 1/(lr), then for any s ∈ ±1m, theestimates from k iterations of the iterative algorithm achieve

limm→∞

1

m

m∑i=1

P(si 6= si

(Aij(i,j)∈E

))≤ e−lq/(2ρ

2k) . (1)

As we increase k, the above bound converges to a non-trivial limit.

Corollary 2.2. Under the hypotheses of Theorem 2.1,

limk→∞

limm→∞

1

m

m∑i=1

P(si 6= si

(Aij(i,j)∈E

))≤ e−lq/(2ρ

2∞) . (2)

Even if we fix the value of q = E[(2pj − 1)2], different distributions of pj can have differentvalues of µ in the range of [q,

√q]. Surprisingly, the asymptotic bound on the error rate does not

depend on µ. Instead, as long as q is fixed, µ only affects how fast the algorithm converges (cf.Lemma 2.3).

Next, we make a few remarks on the performance guarantee.First, the iterative algorithm is efficient with run-time comparable to that of majority voting

which requires O(ml) operations. Each iteration of the iterative algorithm requires O(ml) oper-ations, and we need O(log(q/µ2)/ log(q2 lr)) iterations to ensure an error bound which scales as(2).

Lemma 2.3. Under the hypotheses of Theorem 2.1, the total computational cost sufficient toachieve the bound in Corollary 2.2 up to any constant factor in the exponent is O(ml log(q/µ2)/ log(q2 lr)).

By definition, we have q ≤ µ ≤ √q. The runtime is the worst when µ = q, which happensunder the spammer-hammer model, and it is the best when µ =

√q which happens if pj =

(1+√q)/2 deterministically. There exists a (non-iterative) polynomial time algorithm with runtime

independent of q for computing the estimate which achieves (2), but in practice we expect that

8

1e-05

0.0001

0.001

0.01

0.1

1

5 10 15 20 25 30

Pro

bab

ilit

y o

f er

ror

Number of assignments per task (l)

Majority VotingExpectation Maximization

Iterative AlgorithmOracle Estimator

Figure 1: The iterative algorithm improves over majority voting and EM algorithm [SPI08].

the number of iterations needed is small enough that the iterative algorithm will outperform thisnon-iterative algorithm.

Second, the assumption that µ > 0 is necessary. If there is no assumption on µ, then we cannotdistinguish if the responses came from tasks with sii∈[m] and workers with pjj∈[n] or tasks with−sii∈[m] and workers with 1 − pjj∈[n]. Statistically, both of them give the same output. Thehypothesis on µ allows us to distinguish which of the two is the correct solution. In the case whenwe know that µ < 0, we can use the same algorithm changing the sign of the final output and getthe same performance guarantee.

Third, our algorithm does not require any information on the distribution of pj . Further,unlike previous approaches based on Expectation Maximization (EM), the iterative algorithm isnot sensitive to initialization and converges to a unique solution from a random initialization withhigh probability. This follows from the fact that the algorithm is essentially computing a leadingeigenvector of a particular linear operator.

Finally, we observe a phase transition at lrq2 = 1. Above this phase transition, when lrq2 > 1,we will show that our algorithm is order-optimal and the probability of error is significantly smallerthan majority voting. However, perhaps surprisingly, when we are below the threshold, whenlrq2 < 1, we empirically observe that our algorithm exhibit a fundamentally different behavior.The error we get after k iterations of our algorithm increases with k. In this regime, we are betteroff stopping the algorithm after 1 iteration, in which case the estimate we get is essentially thesame as the simple majority voting, and we cannot do better than majority voting. This phasetransition is universal and we observe similar behavior with other inference algorithms includingexpectation maximization approaches.

This is illustrated in Figure. 1. We ran 10 iterations of expectation maximization and ouriterative algorithm, and compare the performance to majority voting and the oracle estimator. Forthis numerical simulation, we chose l = r and a distribution of the workers such that q = 0.3.Hence, we observe the phase transition around l = 1 + 1/0.3 = 4.3333.

We also ran experiments with real crowd using Amazon Mechanical Turk. In our experiment

9

0

0.1

0.2

0.3

0.4

0.5

4 8 12 16 20 24 28

Pro

bab

ilit

y o

f er

ror

Number of assignments per task (l)

Majority VotingExpectation Maximization

Iterative Algorithm

Figure 2: Real experimental results on color comparison using Amazon Mechanical Turk.

with colors, we created tasks for comparing colors, each task showing three colors, one on the topand two on the bottom. We asked the crowd to indicate “if the color on the top is more similarto the color on the left or on the right”. We created 50 such tasks and recruited 28 workers toanswer all the questions. The ground truth, in this case, is chosen based on the distances in theLab color space between the two pairs of colors, which is a good measure of the perceived distancebetween a pair of colors [WS67]. Once we have this data, we can subsample the data to simulatewhat would have happened if we collected smaller number of responses per task, while keepingthe number of tasks and number of workers fixed. The resulting average probability of error isillustrated in Figure. 2. For this crowd, we can estimate the collective quality from the data,which is about q ' 0.175. Theoretically, this indicates that phase transition should happen when(l − 1)((50/28)l − 1)q2 = 1, since we set r = (50/28)l. With this, we expect phase transition tohappen around l ' 5. In Figure. 2, we see that the phase transition happens around l = 8.

2.3 Optimality under the one-shot scenario

As a taskmaster, the natural core optimization problem of our concern is how to achieve a certainreliability in our answers with minimum cost. The cost is proportional to the total number ofassignments which is the number of edges of the graph G. We show that our algorithm is asymp-totically order-optimal for a broad range of the problem parameters l, r and q. For a given targeterror rate ε, the total budget sufficient to achieve this target error rate using our algorithm is withina constant factor from what is necessary using any graph and the oracle estimator. In this section,we compare our approach to an oracle estimator that operates under the one-shot scenario. Theoptimality of our approach under the iterative scenario is discussed in the next section.

Formally, consider a scenario where there are m tasks to complete and a target accuracy ε ∈

10

(0, 1/2). To measure accuracy, we use the average probability of error per task denoted by

dm(s, s) ≡ 1

m

m∑i=1

P(si 6= si) .

Here the probability is taken over all realizations of the random graph (if using a random graph), in-stances of woke responses, and realizations of worker reliability. We will show that Ω

((1/q) log(1/ε)

)assignments per task is necessary and sufficient to achieve the target error rate: dm(s, s) ≤ ε.

We first prove the following minimax bound on error rate. Consider the case where naturechooses a set of correct answers s ∈ ±1m and a distribution f of the worker reliability pj . Thedistribution f is chosen from a set of all distributions on [0, 1] which satisfy Ef [(2pj − 1)2] = q.We use F(q) to denote this set of distributions. Let G(m, l) denote the set of all bipartite graphs,including irregular graphs, that have m task nodes and ml total number of edges.

Lemma 2.4. The minimax error rate achieved by the best possible graph G ∈ G(m, l) using thebest possible inference algorithm is at least

infAlgo,G∈G(m,l)

sups,f∈F(q)

dm(s, sG,Algo) ≥ 1

2(1− q)l ,

where sG,Algo denotes the estimate we get using graph G for task allocation and algorithm Algo forinference.

This minimax bound is established by computing the error rate of an oracle estimator that makesan optimal decision given the reliability of every worker. When q is equal to one, the inference istrivial and we get a trivial lower bound. The inference problem becomes more challenging whenq ≤ C1 for some numerical constant C1 < 1. In this case, the above lemma implies

infAlgo,G∈G(m,l)

sups,f∈F(q)

dm(s, sG,Algo) ≥ 1

2e−(lq+C2lq2) , (3)

for some numerical constant C2. Let ∆LB the minimum cost per task necessary to achieve a targetaccuracy ε using any graph and the best possible algorithm on that graph. Then, in the case of theminimax scenario where the nature chooses the worst distribution f ,

∆LB = Θ(1

qlog(1

ε

)). (4)

Next, we show that the error rate of majority voting decays significantly slower. Let sG,Majority

be the estimate produced by majority voting on graph G.

Lemma 2.5. In the regime where where q ≤ C2 < 1, there exists a numerical constant C3 suchthat

infG∈G(m,l)

sups,f∈F(q)

dm(s, sG,Majority) ≥ e−C3(lq2+1) .

11

Let ∆Majority be the minimum cost per task necessary to achieve a target accuracy ε usingthe majority voting scheme on any graph. Then, in the case where the nature chooses the worstdistribution f ,

∆Majority = Ω( 1

q2log(1

ε

)). (5)

The lower bound in (3) holds regardless of how many tasks are assigned to each worker. However,error rate of our algorithm depends on the value of r. We show that for a broad range of parametersl, r, and q, our algorithm achieves optimality. Let sIter be the estimate given by random regulargraphs and the iterative algorithm. For lq ≥ C4, rq ≥ C5 and C4C5 > 1, Corollary 2.2 gives

limm→∞

sups,f∈F(q)

dm(s, sIter) ≤ e−C6lq .

Let ∆Iter be the minimum cost per task sufficient to achieve a target accuracy ε using ourproposed algorithm. Then we get

∆Iter = Θ(1

qlog(1

ε

)).

Comparing it to the necessary budget in (4), this establishes the order-optimality of our algorithm.Further, from (5) we see that majority voting is significantly more costly than the optimal scaling of(1/q) log(1/ε) of our algorithm. Finally, we emphasize that the low-rank approximation algorithmis quite efficient. Simple majority voting requires O

((1/q2) log(1/ε)

)operations per task to achieve

target error rate ε, and our approach requires O((1/q) log(m) log(1/ε)) operations per task.

2.4 Optimality under the iterative scenario

We show in this section that, surprisingly, there is no significant gain in switching from our approachto an adaptive strategy, which can assign tasks adaptively based on the responses from the workers.The order-optimality of our approach, in terms of budget required to achieve a given target accuracy,still holds even if we include a more general class of algorithms that can adaptively choose how toassign tasks to new batches.

Under the one-shot scenario, all task allocations has to be done simultaneously. In particular,we cannot create new batches adaptively based on what we learned from the responses thus far(e.g. which workers are more reliable or which tasks we have less confidence in). Alternatively, wecan consider a more general class of algorithms which operate under an iterative scenario, whereone can adaptively assign tasks. One might be tempted to first identify which workers are morereliable and then assign all the tasks to those workers in an explore/exploit manner. However, withcrowdsourcing, it is unrealistic to assume that we can identify and reuse any particular worker,including the reliable ones, since typical workers are neither persistent nor identifiable and batchesare distributed through an open-call. Hence, under an iterative scenario in crowdsourcing, althoughwe cannot reuse any of the workers, we can adaptively assign more workers to those tasks that wehave less confidence in, by assigning those tasks to the new batches that we create adaptively. Thesenew batches are then distributed to unidentified workers through an open-call using a crowdsourcingplatform.

For example, in the first round we can choose a graph and run inference as in the one-shotscenario. Then, in the second round, we might choose to assign additional workers to those tasks

12

that are ‘less reliable’ in order to increase reliability. Can we reduce the budget and still ensurethe same probability of error by exploiting this adaptive process? Surprisingly, we show that interms of the average necessary budget, there is no significant gain in using an adaptive strategy.Hence, our approach of using a random graph with our iterative inference algorithm achievesorder-optimal performance even if we include all the adaptive strategies. This is a stronger resulton order-optimality of our approach, since our algorithm is a non-adaptive one that operates underthe one-shot scenario, which is a special case of the iterative scenario. Precisely, we show that inorder to achieve an average probability of error less than ε using any possible strategy that canadaptively assign tasks under the iterative scenario with the best possible inference algorithm, it isstill necessary to have budget that scales like (1/q) log(1/ε).

This result is established by considering an oracle estimator who knows the reliability of allthe workers and can use any adaptive task assignment scheme. To compute the accuracy of thisoracle estimator, we need to assume in this case that si’s are chosen independently and there is anon-vanishing probability of si being either positive or negative.

Lemma 2.6. Assume that the correct answer to each task is chosen independently, and P(si =+1) > 0 and P(si = −1) > 0 independent of q, m, and ε. Given a target accuracy ε ∈ (0, 1/2), let∆AdaptiveLB(ε,Algo, f) be the budget per task necessary to achieve (1/m)

∑mi=1 P(si 6= si) ≤ ε using

algorithm Algo for both inference and task assignment, when the worker quality pj is distributedaccording to a distribution f on [0, 1]. Then for any algorithm that operates under the iterativescenario and for q ∈ (0, 0.64], the following holds.

infAlgo

supf∈F(q)

E[∆AdaptiveLB(ε,Algo, f)] = Ω(1

qlog(1

ε

)),

where F(q) denotes the set of all distributions on [0, 1] which satisfy E[(2pj − 1)2] = q.

This proves that there is no significant gain in using an adaptive scheme, and our approachachieves the order-optimal performance with a simple non-adaptive scheme.

We identify that there are two regimes of the worker quality distribution f , where the behavior ofthe adaptive strategies differ significantly. This is captured in a quantity W ≡ Ef [log(pj/(1−pj))].In particular, to prove the above minimax lower bound, we consider the case when nature choosesa worst case worker distribution f , and it is crucial that we choose f such that W is finite.

On the other hand, when there is a positive probability that a worker is either a perfect worker(pj = 1) or a perfect adversary (pj = 0), then W = ∞ or W = −∞, respectively. If we limitourselves to a special case of f in this regime where W = ±∞, then there are examples wherethere is a significant gain in switching to an adaptive scheme. In particular, we can show that thereare examples of the worker quality distribution f , where it is both necessary and sufficient to havebudget scaling as 1/q using adaptive schemes, as opposed to (1/q) log(1/ε) in the non-adaptive case.However, in general minimax scenario, where the nature chooses the worst case f as in Lemma2.6, there is no gain in using an adaptive scheme, and our approach achieves the order-optimalperformance with a simple non-adaptive scheme. Further, in practice, it is unlikely that there willbe perfect workers or perfect adversaries, hence it is expected that W <∞.

To be concrete, let us consider the spammer-hammer model, where a worker is either a spammerwith pj = 1/2 or a hammer with pj = 1. A worker quality is randomly chosen such that she is ahammer with probability q. Notice that, in this case, we are in the regime where W =∞. We know

13

that under this spammer-hammer model, if we only use non-adaptive schemes, then it is necessaryand sufficient to have budget that scales like (1/q) log(1/ε) (cf. proof of Lemma 2.4). However, inthe following, we will show that an adaptive strategy can achieve a target accuracy ε with budgetthat scales as 1/q, achieving a significant gain over non-adaptive schemes. Further, we provide amatching lower bound using an oracle estimator.

Consider an algorithm which partitions the m tasks into sets of size r. At each round, for aparticular set of r tasks, the algorithm recruits a worker and assigns all r tasks to that new worker.The algorithm stops when there are two workers who completely agreed on all r tasks, and thenmoves onto the next set of tasks. This algorithm will surely stop if we have two hammers, in whichcase the expected number of workers we need to recruit is at most 2/q. Further, if we choose rto be Ω(log(1/ε) + log(1/q)), then, when this algorithm stops, we can ensure that the probabilityof making an error is at most ε. Hence, with this algorithm, it is sufficient to achieve any targetaccuracy ε with budget scaling like 1/q. Notice that in order to prove this sufficient condition, itis necessary to have the worker degree r which scales as Ω(log(1/ε) + log(1/q)). There might beways to improve this condition on r by using a more sophisticated algorithm, which might be aninteresting future research direction.

Next, consider an oracle estimator which knows who is a spammer and who is a hammer. Then,under the iterative scenario, the oracle estimator can assign workers to a task ti until a hammeris assigned. This requires recruiting 1/q workers on expectation, and the probability of error iszero. Hence, if we let ∆AdaptiveSH be the budget necessary to achieve a given accuracy ε under thespammer-hammer model, then this implies that E[∆AdaptiveSH] ≥ 1/q.

2.5 Discussion

Here we discuss several implications of our result and possible future research directions in gener-alizing the model studied in this paper.

First, we provide a performance bound under the below threshold regime when lrq2 < 1.Notice that the bound in (2) is only meaningful when it is less than a half, whence lrq2 > 1 andlq > 6 log(2) > 4. While as a task master the case of lrq2 < 1 may not be of interest, for the purposeof completeness we comment on the performance of our algorithm in this regime. Specifically, weempirically observe that the error rate increases as the number of iterations k increases. Therefore,it makes sense to use k = 1. In which case, the algorithm essentially boils down to the majorityrule. We can prove the following error bound.

Lemma 2.7. For any value of l, r and µ, the estimates we get after first step of our algorithmachieve

limm→∞

1

m

m∑i=1

P(si 6= si

(Aij(i,j)∈E

))≤ e−lµ

2/4 . (6)

Next, consider the variation where we ask questions to workers whose answers are already knownin order to assess the quality of the workers (also known as ‘gold standard units’). There are twoways we can use this information. First, we can place ‘seed gold units’ along with the standardtasks, and use these ‘seed gold units’ in turn to perform more informed inference. However, we canshow that there is no gain in using such ‘seed gold units’. The optimal lower bound of 1/q log(1/ε)essentially utilizes the existence of oracle that can identify the reliability of every worker exactly,

14

i.e. the oracle has a lot more information than what can be gained by such qualifying questions.Therefore, clearly ‘seed gold units’ do not help the oracle estimator, and hence the order optimalityof our approach still holds even if we include all the strategies that can utilize these ‘seed goldunits’.

Second, we can use ‘pilot gold units’ as qualifying or pilot questions that the workers mustcomplete to qualify to participate. Typically a taskmaster do not have to pay for these qualifyingquestions and this provides an effective way to increase the quality of the participating workers.Our approach can benefit from such ‘pilot gold units’, which has the effect of increasing the effectivecollective quality of the crowd q. Further, if we can ‘measure’ how the distribution of workers changewhen using pilot questions, then our main result fully describes how much we can gain by suchpilot questions. In any case, pilot questions only change the distribution of participating workers,and the order-optimality of our approach still holds even if we compare all the schemes that usethe same pilot questions.

Next, we consider workers with different reliabilities and their prices or payments. Specifically,suppose there are K classes of worker: workers of class k, 1 ≤ k ≤ K, have their reliabilitydistribution parameter qk and each of them requires payment of ck to perform a task. Now ouroptimality result suggests that the per-task cost scales as ck/qk log(1/ε) if we only used workersof class k. More generally, if we use a mix of these workers, say αk fraction of workers from classk, with

∑k αk = 1, then the effective parameter q =

∑k αkqk. And subject to this, the optimal

per task cost scales as (∑

k αkck)/(∑

k αkqk) log(1/ε). This immediately suggests that the optimalchoice of fraction αk must be such that αk > 0 only if ck/qk = min` c`/q`. That is, optimal choiceis to select workers only from the classes that have minimal ratio of ck/qk over 1 ≤ k ≤ K.

We now discuss the assumed knowledge of q in selecting the degree l = Θ(1/q log(1/ε)) in thedesign of the regular bipartite graph that achieve a given target error rate. Here is a simple wayto overcome this limitation at the loss of only additional constant factor, i.e. scaling of cost pertask still remains Θ(1/q log(1/ε)). To that end, consider an incremental design in which at step athe system is designed assuming q = 2−a for a ≥ 1. At step a, we design two replicas of the taskallocation for q = 2−a. Now compare the estimates obtained by these two replicas for all m tasks.If they agree amongst m(1− 2ε) tasks, then we stop and declare that as the final answer. Or else,we increase a to a + 1 and repeat. Note that by our optimality result, it follows that if 2−a isless than the actual q then the iteration must stop with high probability. Therefore, the total costpaid is Θ(1/q log(1/ε)) with high probability. Thus, even lack of knowledge of q does not affect theoptimality of our algorithm.

Finally, we consider possible generalization of our model. The model assumed in this paperdoes not capture several factors: tasks with different level of difficulties or workers who alwaysanswer positive or negative. It is desirable to characterize how our algorithm works under such amore general model.

In general, the responses of a worker j to a binary question i may depend on several factors:(i) the correct answer to the task; (ii) the difficulty of the task; (iii) the expertise or the reliabilityof the worker; (iv) the bias of the worker towards positive or negative answers. Let si ∈ +1,−1represent the correct answer and ri ∈ [0,∞] represents the level of difficulty of task i. Here ri = 0means that the task is so easy that any worker can find the correct answer, and ri =∞ means thatthe task is so difficult that even the most diligent worker cannot resolve which is the correct answer.Let αj ∈ [−∞,∞] represent the reliability and βj ∈ (−∞,∞) represent the bias of worker j. Here,αj =∞ means that the worker answers all tasks correctly, αj = −∞ means that the worker answers

15

all tasks incorrectly, and αj = 0 means that the worker gives random answers independent of whatthe correct answer is. Further, βj = ∞ or −∞ means that the worker always answers positive ornegative respectively, and βj = 0 means that the worker is unbiased and makes error equally likelywhether the correct answer is positive or negative. In formula, a worker j’s response to a binarytask i can be modeled as

Aij = sign(Zi,j) ,

where Zi,j is a Gaussian random variable distributed as Zi,j ∼ N (αjsi+βj , ri), and it is understoodthat sign(Z) is an unbiased binary random variable for Z ∼ N (1,∞), and sign(Z) = 1 almost surelyfor Z ∼ N (∞, 1). Except for the fact that we consider a random variable Zi,j instead of a randomvector, this is equivalent to the model studied in [WBBP10].

Most of the models studied in the crowdsourcing literature can be reduced to a special case ofthis model. For example, the early crowdsourcing model introduced by Dawid and Skene [DS79]is equivalent to the above Gaussian model with ri = 1. More recently, Whitehill et al. [WRW+09]introduced another model where P(Aij = si|ai, bj) = 1/(1 + e−aibj ), with worker reliability ai andtask difficulty bj . This is again a special case of the above Gaussian model if we set βj = 0. Themodel we study in this paper has an underlying assumption that all the tasks share an equal levelof difficulty and the workers are unbiased. It is equivalent to the above Gaussian model with βj = 0and ri = 1. In this case, there is a one-to-one relation between the worker reliability pj and αj :pj = Q(αj), where Q(·) is the tail probability of the standard Gaussian distribution.

The role of the bias of the workers is also important in correctly identifying the quality of theworkers, in order to selectively pay the workers based on their performance. Ipeirotis, Provost,and Wang studied how to separate the true error rate from the biases that some workers exhibit inorder to obtain better evaluation of the workers’ quality in [IPW10].

2.6 Relations to low-rank matrix approximation

The leading singular vectors are often used to capture the important aspects of datasets in matrixform. In our case, the leading left singular vector of A can be used to estimate the correct answers,where A ∈ 0,±1m×n is the m × n adjacency matrix of the graph G weighted by the submittedanswers. One way to compute the leading singular vector is through power iteration: for two vectorsu ∈ Rm and v ∈ Rn, starting with a randomly initialized v, power iteration iteratively updates uand v according to

for all i, ui =∑j∈∂i

Aijvj , and for all j, vj =∑i∈∂j

Aijui .

It is known that normalized u (and v) converges linearly to the leading left (and right) singularvector. This update rule is very similar to that of our iterative algorithm. But there is onedifference that is crucial in the analysis: in our algorithm we follow the framework of the celebratedbelief propagation algorithm [Pea88, YFW03] and exclude the incoming message from node j whencomputing an outgoing message to j. This extrinsic nature of our algorithm and the locally tree-like structure of sparse random graphs [RU08, MM09] allow us to perform sharp analysis on theaverage error rate. In particular, if we use the leading singular vector of A to estimate s, suchthat si = sign(ui), then existing analysis techniques from random matrix theory does not give thestrong performance guarantee we have (cf. [KOS11, Theorem II.1]). These techniques typically

16

focus on understanding how the subspace spanned by the top singular vector behaves. To get thesharp bound in Theorem 2.1, we need to analyze how each entry of the leading singular vector isdistributed. We introduce the iterative algorithm in order to precisely characterize how each ofthe decision variable xi is distributed. Since the iterative algorithm introduced in this paper isquite similar to power iteration used to compute the leading singular vectors, this suggests thatour analysis may shed light on how to analyze the top singular vectors of a sparse random matrix.

3 Proof of main results

In this section, we provide proofs of the main results.

3.1 Proof of Theorem 2.1

By symmetry, we can assume all si’s are +1. If I is a random integer drawn uniformly in [m], then

1

m

∑i∈[m]

P(si 6= si) = P(sI 6= sI) = P(x(k)I < 0) +

1

2P(x

(k)I = 0) ≤ P(x

(k)I ≤ 0) ,

where x(k)i denotes the decision variable for task i after k iterations of the iterative algorithm.

Asymptotically, for a fixed k, l and r, the local neighborhood of xI converges to a regular tree. To

analyze limm→∞ P(x(k)I ≤ 0), we use a standard probabilistic analysis technique known as ‘density

evolution’ in coding theory or ‘recursive distributional equations’ in probabilistic combinatorics[RU08, MM09]. Precisely, we use the following equality that in the large system limit,

limm→∞

P(x(k)I ≤ 0) = P(x(k) ≤ 0) , (7)

where x(k) is defined through density evolution equations (8) and (9) in the following.

Density evolution. In the large system limit as m→∞, the (l, r)-regular random graph locallyconverges in distribution to a (l, r)-regular tree. Therefore, for a randomly chosen edge (i, j), themessages xi→j and yj→i converge in distribution to x and ypj defined in the following densityevolution equations (8). Here and after, we drop the superscript k denoting the iteration numberwhenever it is clear from the context. We initialize yp with a Gaussian distribution independent of

p: y(0)p ∼ N (1, 1). Let

d= denote equality in distribution. Then, for k ∈ 1, 2, . . .,

x(k) d=∑i∈[l−1]

zpi,iy(k−1)pi,i

, y(k)p

d=

∑j∈[r−1]

zp,jx(k)j , (8)

where xj ’s, pi’s, and yp,i’s are independent copies of x, p, and yp, respectively. Also, zp,i’s andzp,j ’s are independent copies of zp. p ∈ [0, 1] is a random variable distributed according to the dis-tribution of the worker’s quality. zp,j ’s and xj ’s are independent. zpi,i’s and ypi,i’s are conditionallyindependent conditioned on pi. Finally, zp is a random variable distributed as

zp =

+1 with probability p ,−1 with probability 1− p .

17

Then, for a randomly chosen I, the decision variable x(k)I converges in distribution to

x(k) d=∑i∈[l]

zpi,iy(k−1)pi,i

. (9)

Numerically or analytically computing the densities in (8) exactly is not computationally feasiblewhen the messages take continuous values as is the case for our algorithm. Typically, heuristics areused to approximate the densities such as quantizing the messages, approximating the density withsimple functions, or using Monte Carlo method to sample from the density. Novel contributions ofour analysis is that we prove that the messages are sub-Gaussian using recursion, and we providean upper bound on the parameters in a closed form. This allows us to prove the sharp result onthe error bound that decays exponentially.

Mean and variance computation. To give an intuition on how the messages behave, wedescribe the evolution of the mean and the variance of the random variables in (8). Define m(k) ≡E[x(k)], m

(k)p ≡ E[y

(k)p |p], v(k) ≡ Var(x(k)), and v

(k)p ≡ Var(y

(k)p |p). Let p be a random variable

distributed according to the measure µ. Then, from (8) we get that

m(k) = lEp

[(2p− 1)m

(k−1)p

],

m(k)p = r (2p− 1)m(k) ,

v(k) = l

Ep[v

(k−1)p +

(m

(k−1)p

)2]−(Ep

[(2p− 1)m

(k−1)p

])2,

v(k)p = r

v(k) + (m(k))2 −

((2p− 1)m(k)

)2.

Recall that µ = E[2p − 1] and q = E[(2p − 1)2]. Substituting mp and vp we get the followingevolution of the first and the second moment of the random variable x(k).

m(k+1) = lrqm(k) ,

v(k+1) = lrv(k) + lr(m(k))2(1− q)(1 + rq) .

Since m(0)p = 1 and v(0) = 1 as per our assumption, we have m(1) = µl and v(1) = l(4 − µ2). This

implies that m(k) = µl(lrq)k−1, and v(k) = av(k−1) + bck−2, with a = lr, b = µ2 l3r(1 − q)(1 + rq),and c = (lrq)2. After some algebra, it folows that v(k) = v(1)ak−1 + bck−2

∑k−2`=0 (a/c)`.

For lrq2 > 1, we have a/c < 1 and

v(k) = l(4− µ2)(lr)k−1 + (1− q)(1 + rq)µ2 l2(lrq)2k−2 1− 1/(lrq2)k−1

lrq2 − 1.

The first and second moment of the decision variable x(k) in (9) can be computed using a similaranalysis: E[x(k)] = (l/l)m(k) and Var(x(k)) = (l/l)v(k). In particular, we have

Var(x(k))

E[x(k)]2=

l(4− µ2)

llµ2(lrq2)k−1+l(1− q)(1 + rq)

l(lrq2 − 1)

(1− 1

(lrq2)k−1

).

Applying Chebyshev’s inequality, it immediately follows that P(x(k) < 0) is bounded by the right-hand side of the above equality. This bound is weaker compared to the bound in Theorem 2.1.

18

In the following, we prove the stronger result using sub-Gaussianity of x(k)’s. But first, let usanalyze what this weaker bound gives for different regimes of l, r, and q, which indicates that themessages exhibit a fundamentally different behavior in the regimes separated by a phase transitionat lrq2 = 1.

In a ‘good’ regime where we have lrq2 > 1, the bound converges to a finite limit as the numberof iterations k grows. Namely,

limk→∞

P(x(k) < 0) ≤ l(1− q)(1 + rq)

l(lrq2 − 1).

In the case when lrq2 < 1, the same analysis gives

Var(x(k))

E[x(k)]2= eΘ(k) .

Finally, when lrq2 = 1, we get v(k) = (lr)k + lr(1− q)(1 + rq)(lrq)2k−2k, which implies

Var(x(k))

E[x(k)]2= Θ(k) .

Analyzing the density. Our strategy to provide an upper bound on P(x(k) ≤ 0) is to show thatx(k) is sub-Gaussian with appropriate parameters and use the Chernoff bound. A random variablez with mean m is said to be sub-Gaussian with parameter σ if for all λ ∈ R the following inequalityholds:

E[eλz] ≤ emλ+(1/2)σ2λ2 .

Define

σ2k ≡ 2l(lr)k−1 + µ2 l3r(3qr + 1)(qlr)2k−4 1− (1/q2 lr)k−1

1− (1/q2 lr),

and mk ≡ µl(qlr)k−1 for k ∈ Z. We will first show that, x(k) is sub-Gaussian with mean mk

and parameter σ2k for a regime of λ we are interested in. Precisely, we will show that for |λ| ≤

1/(2mk−1r),

E[eλx(k)

] ≤ emkλ+(1/2)σ2kλ

2. (10)

By definition, due to distributional independence, we have E[eλx(k)

] = E[eλx(k)

](l/l). Therefore, it

follows from (10) that x(k) satisfies E[eλx(k)

] ≤ e(l/l)mkλ+(l/2l)σ2kλ

2. Applying the Chernoff bound

with λ = −mk/(σ2k), we get

P(x(k) ≤ 0

)≤ E

[eλx

(k)] ≤ e−l m2k/(2 l σ

2k) , (11)

Since mkmk−1/(σ2k) ≤ µ2 l2(qlr)2k−3/(3µ2ql3r2(qlr)2k−4) = 1/(3r), it is easy to check that |λ| ≤

1/(2mk−1r). Substituting (11) in (7), this finishes the proof of Theorem 2.1.

19

Now we are left to prove that x(k) is sub-Gaussian with appropriate parameters. We can writedown a recursive formula for the evolution of the moment generating functions of x and yp as

E[eλx

(k)]=

(Ep

[pE[eλy

(k−1)p |p] + pE[e−λy

(k−1)p |p]

])l, (12)

E[eλy

(k)p]

=(pE[eλx

(k)]+ pE

[e−λx

(k)])r, (13)

where p = 1− p and p = 1− p. We can prove that these are sub-Gaussian using induction.First, for k = 1, we show that x(1) is sub-Gaussian with mean m1 = µl and parameter σ2

1 = 2l,where µ ≡ E[2p − 1]. Since yp is initialized as Gaussian with unit mean and variance, we have

E[eλy(0)p ] = eλ+(1/2)λ2 regardless of p. Substituting this into (12), we get for any λ,

E[eλx

(1)]

=(E[p]eλ + (1− E[p])e−λ

)le(1/2)lλ2 ≤ elµλ+lλ2 , (14)

where the inequality follows from the fact that aez + (1 − a)e−z ≤ e(2a−1)z+(1/2)z2 for any z ∈ Rand a ∈ [0, 1] (cf. [AS08, Lemma A.1.5]).

Next, assuming E[eλx(k)

] ≤ emkλ+(1/2)σ2kλ

2for |λ| ≤ 1/(2mk−1r), we show that E[eλx

(k+1)] ≤

emk+1λ+(1/2)σ2k+1λ

2

for |λ| ≤ 1/(2mkr), and compute appropriate mk+1 and σ2k+1. Substituting the

bound E[eλx(k)

] ≤ emkλ+(1/2)σ2kλ

2in (13), we get E[eλy

(k)p ] ≤ (pemkλ + pe−mkλ)re(1/2)rσ2

kλ2. Further

applying this bound in (12), we get

E[eλx

(k+1)]≤

(Ep

[p(pemkλ + pe−mkλ)r + p(pe−mkλ + pemkλ)r

])le(1/2)lrσ2

kλ2. (15)

To bound the first term in the right-hand side, we use the next key lemma.

Lemma 3.1. For any |z| ≤ 1/(2r) and p ∈ [0, 1] such that q = E[(2p− 1)2], we have

Ep

[p(pez + pe−z)r + p(pez + pe−z)r

]≤ eqrz+(1/2)(3qr2+r)z2 .

Applying this inequality to (15) gives

E[eλx(k+1)

] ≤ eqlrmkλ+(1/2)(

(3qlr2+lr)m2k+lrσ2

k

)λ2 ,

for |λ| ≤ 1/(2mkr). In the regime where qlr ≥ 1 as per our assumption, mk is non-decreasing in k.At iteration k, the above recursion holds for |λ| ≤ min1/(2m1r), . . . , 1/(2mk−1r) = 1/(2mk−1r).Hence, we get the following recursion for mk and σk such that (10) holds for |λ| ≤ 1/(2mk−1r).

mk+1 = qlrmk ,

σ2k+1 = (3qlr2 + lr)m2

k + lrσ2k .

With the initialization m1 = µl and σ21 = 2l, we have mk = µl(qlr)k−1 for k ∈ 1, 2, . . . and

σ2k = aσ2

k−1 + bck−2 for k ∈ 2, 3, . . ., with a = lr, b = µ2 l2(3qlr2 + lr), and c = (qlr)2. After some

algebra, it follows that σ2k = σ2

1ak−1 + bck−2

∑k−2`=0 (a/c)`. For lrq2 6= 1, we have a/c 6= 1, whence

σ2k = σ2

1ak−1 + bck−2(1− (a/c)k−1)/(1− a/c). This finishes the proof of (10).

20

3.2 Proof of Lemma 3.1

By the fact that aeb+(1−a)e−b ≤ e(2a−1)b+(1/2)b2 for any b ∈ R and a ∈ [0, 1], we have pez+pe−z ≤e(2p−1)z+(1/2)z2 almost surely. Applying this inequality once again, we get

E[p(pez + pe−z)r + p(pez + pe−z)r

]≤ E

[e(2p−1)2rz+(1/2)(2p−1)2r2z2

]e(1/2)rz2 .

Using the fact that ea ≤ 1 + a+ 0.63a2 for |a| ≤ 5/8,

E[e(2p−1)2rz+(1/2)(2p−1)2r2z2

]≤ E

[1 + (2p− 1)2rz + (1/2)(2p− 1)2r2z2 + 0.63((2p− 1)2rz + (1/2)(2p− 1)2r2z2)2

]≤ 1 + qrz + (3/2)qr2z2

≤ eqrz+(3/2)qr2z2 ,

for |z| ≤ 1/(2r). This proves Lemma 3.1.


Each iteration of the iterative algorithm requires O(ml) operations. To compute xi→j , we can firstcompute

∑j∈∂iAijyj→i and then each outgoing messages can be computed in constant time by

subtracting the appropriate outgoing messages: xi→j =∑

j∈∂iAijyj→i −Aijyj→i for each j ∈ ∂i.We need O(log(q/µ2)/ log(q2 lr)) iterations to ensure an error bound which scales as (2). Using

the definition of σk in (1), notice that the relative error is bounded by

σ2k − σ2

∞σ2∞

≤( qµ2

)(lrq2)(−k+1) .

For lrq2 > 1 as per our assumption, k = O(log(q/µ2)/ log(lrq2)) iterations suffice to ensure thatthe right-hand side is bounded by a constant.


In this minimax scenario, we consider a distribution of the worker reliability from the spammer-hammer model, and prove a lower bound on the error achieved by an oracle estimator that knowsthe reliability of every worker.

Under the spammer-hammer model there are only two kinds of workers, a hammer with pj = 1always gives the right answer, and a spammer with pj = 1/2 always gives a random answer. Eachworker is a spammer with probability q and hammer with probability 1−q such that E[(2p−1)2] = q.

Let us consider an oracle estimator that makes the optimal decision based on informationprovided by an oracle. The oracle gives information on which worker is a spammer and which oneis a hammer. Then, this oracle estimator only makes mistakes on tasks whose neighbors are allspammers, in which case the estimator flips a fair coin to make a decision. If we denote the degreeof a task node i by li, the error rate P(si 6= si) is (1/2)(1− q)li . Note that no algorithm with accessto only Aij can produce more accurate estimate, when nature chooses the worst case si, than

21

the oracle estimator. Therefore, for any estimate s that is a function of Aij(i,j)∈E , we have thefollowing minimax bound.

infALGO

sups∈±1m,f∈F(q)

dm(s, sG,ALGO) ≥ 1

m

∑i∈[m]

1

2(1− q)li ,

for any graph G that have m task nodes and degree li for node ti. By convexity, the right-handside is always larger than (1/2)(1− q)|E|/m, where |E| =

∑i li. This proves the desired claim. Let

us emphasize that this minimax lower bound is completely general and holds for any graph, regularor irregular.


Now, consider a naive majority voting algorithm. Majority voting simply follows what the majorityof workers agree on. In formula, si = sign(

∑j∈∂iAij), where ∂i denotes the neighborhood of node

i in the graph. It makes a random choice when there is a tie. When we have many spammers inthe crowd, majority voting is prone to make mistakes since it gives the same weight to both theestimates provided by spammers and that of diligent workers.

We want to compute a lower bound on P(si 6= si). Let xi =∑

j∈∂iAij where ∂i denotesthe neighborhood of task node ti. Assuming si = +1 without loss of generality, the error rate islower bounded by P(xi < 0). After rescaling, (1/2)(xi + l) is a standard binomial random variableBinom(l, α), where l is the number of neighbors of the node i, α = E[pj ], and by assumption eachAij is one with probability α.

It follows that P(xi = −l + 2k) = ((l!)/(l − k)!k!)αk(1 − α)l−k. Further, for k ≤ αl − 1, theprobability distribution function is monotonically increasing. Precisely,

P(xi = −l + 2(k + 1))

P(xi = −l + 2k)≥ α(l − k)

(1− α)(k + 1)≥ α(l − αl + 1)

(1− α)αl> 1 ,

where we used the fact that the above ratio is decreasing in k whence the minimum is achieved atk = αl − 1 under our assumption.

First, we give an outline of the proof strategy, using simple Gaussian example. In the limitas l grows large, xi converges in distribution to a Gaussian random variable with mean (2α − 1)land variance 4lα(1 − α). Note that the Gaussian pdf g(z) = (1/

√2πVar(xi)) exp−(1/2)(z −

E[xi])2/Var(xi) is monotonically increasing for z ≤ E[xi]. Then, P(xi < 0) can be lower bounded

by

P(xi < 0) ≤√l g(−

√l)

=1√

8πα(1− α)exp

− ((2α− 1)l +

√l)2

24lα(1− α)

= exp

− C1l(2α− 1)2 +O

(1 +

√l(2α− 1)2

),

for 0 < E[pj ] < 1 and some constant C1. This shows that under the Gaussian assumption, we needl ∼ log(1/ε)/(2α− 1)2 to be able to ensure that the probability of error is less than ε ∈ (0, 1/2).

Using a similar strategy, we can prove a concrete bound on P(xi < 0) for binomial xi. Letus assume that l is even, so that xi take even values. When l is odd, the same analysis works,

22

but xi takes odd values. Our strategy is to use a simple bound: P(xi < 0) ≥ kP(xi = −2k).By assumption that α = E[pj ] ≥ 1/2, For an appropriate choice of k =

√l, the right-hand side

closely approximates the error probability as we saw in the Gaussian example. By definition of xi,it follows that

P(xi = −2

√l)

=

(l

l/2 +√l

)αl/2−

√l(1− α

)l/2+√l. (16)

Applying Stirling’s approximation, we can show that(l

l/2 +√l

)≥ C2√

l2l , (17)

for some numerical constant C2. We are interested in the case where worker quality is low, thatis α is close to 1/2. Accordingly, for the second and third terms in (16), we expand in terms of2α− 1.

log

(αl/2−

√l(

1− α)l/2+

√l)

=( l

2−√l)(

log(1 + (2α− 1))− log(2))

+( l

2+√l)(

log(1− (2α− 1))− log(2))

= −l log(2)− l(2α− 1)2

2+O(

√l(2α− 1)4) . (18)

We can substitute (17) and (18) in (16) to get the following bound:

P(xi < 0) ≥ exp− C3(l(2α− 1)2 + 1)

, (19)

for some numerical constant C3.Now, let li denote the degree of task node i. Then for any si ∈ ±1m, any distribution of p

such that µ = E[2p− 1] = 2α− 1, and any graph G with m task nodes, the following lower boundis true.

dm(s, sG,Majority) ≥ 1

m

m∑i=1

e−C3(liµ2+1) .

By convexity, this implies

dm(s, sG,Majority) ≥ e−C3((|E|/m)µ2+1) .

Under the spammer-hammer model, where µ = q this gives

infG∈G(m,l)

sups,f∈F(q)

dm(s, sG,Majority) ≥ e−C3(lq2+1) .

This finishes the proof of lemma.

23


To prove the minimax bound, let nature choose the following distribution of the workers which wedenote (a, q)-model. Under this model, we assume that q ≤ a2 and

pj =

1/2 with probability 1− (q/a2) ,

1/2(1 + a) with probability q/a2 ,

such that E[(2pj − 1)2] = q. We assume that each task is chosen i.i.d. and is a positive task withprobability bi:

si =

+1 with probability bi ,−1 with probability 1− bi ,

and assume bi > 1/2 without loss of generality. We consider an oracle estimator that knows thereliability of every worker (all the pj ’s), the values of bi’s, the value of a and q, and can chooseany adaptive task allocation scheme. After we run this oracle estimator, let Li be the randomvariable representing the total number of workers assigned to task ti. The oracle estimator knowsall the values necessary to compute the error probability of each task. Let Ei be the randomvariable representing the error probability as computed by the oracle estimator, conditioned on theLi responses we get from the workers and their reliability pj ’s. We are interested in how large theaverage budget (1/m)

∑i E[Li] has to be in order to achieve (1/m)

∑i E[Ei] ≤ ε. In the following

we will show that for any task ti, independent of which task allocation scheme is used, it is necessarythat

E[Li] ≥C

qlog( 1

E[Ei]

), (20)

for some constant C that only depends on bi. By convexity of log(1/x) and Jensen’s inequality,this implies that

1

m

m∑i=1

E[Li] ≥C

qlog( 1

(1/m)∑m

i=1 E[Ei]

).

Hence, in order to achieve (1/m)∑

P(si 6= si) ≤ ε, it is necessary for this oracle estimator to spendbudget which scales as (1/q) log(1/ε) for each task. This finishes the proof of the lemma.

Now, we are left to prove (20). Focusing on a single task ti, since spammers with pj = 1/2 giveus no information about the task, we only need to consider the responses from the reliable workers.Let us consider a case where after getting responses from li reliable workers (with pj = (1/2)(1+a)),the oracle estimator computes si according to an optimal majority rule based on the value of bi anda, and computes the probability of making an error. Although this probability of error depends onwhat the responses are, there is a simple way to compute a lower bound on the probability of errorregardless of the actual responses we get.

In the best case, all reliable workers would have agreed on the task ti. Further, since we assumedthat bi > 1/2, the probability of making an error is smallest if all the reliable workers agreed thatthe task is a positive task, where si = +1. Then, P(si = +1 and li reliable workers responded +) =

24

bi((1 + a)/2)li , and P(si = −1 and li reliable workers responded +) = (1− bi)((1− a)/2)li .

P(error | li responses from reliable workers

)≥ P

(si = −1 | li reliable workers responded +

)=

(1− bi)((1/2)(1− a))li

(1− bi)((1/2)(1− a))li + bi((1/2)(1 + a))li

=1

1 + bi1−bi ((1 + a)/(1− a))li

For a given li, let Ei = E[I(si 6= si) | li responses from reliable workers] be the random variablerepresenting the conditional error rate. The above inequality implies that

Ei ≥1

1 + bi1−bi ((1 + a)/(1− a))li

,

almost surely. Let Li be the number of workers we need to recruit to see li reliable workers. Sinceon average, we need to recruit a2/q workers to see one reliable worker, we have that, almost surely,

E[Li] =a2

qli

≥ 1

q

a2

log(

(1 + a)/(1− a)) log

((1−Ei)(1− bi)Eibi

)

≥ 1

q

a2

log(

(1 + a)/(1− a)) log

((1− bi)2Eibi

),

where, in the last inequality, we used the fact that whenever the oracle estimator makes an estimate,the error rate is at most 1/2, since a random guess guarantees error probability of 1/2. Since thisinequality holds almost surely and for all values of li, it also holds in expectation such that

E[Li] ≥1

q

a2

log(

(1 + a)/(1− a)) log

( (1− bi)2E[Ei]bi

).

Maximizing over all choices of a ∈ (0, 1), we get

E[Li] ≥0.29

qlog( (1− bi)

2E[Ei]bi

),

which in particular is true with a = 0.8. For this choice of a, the result holds in the regime whereq ≤ 0.64 such that the probability of seeing a reliable worker under (a, q)-model is at most one.For any bi bounded away from one, as per our assumption, this implies (20).


The probability of making an error after 1 iteration of our algorithm for node i is P(si 6= si) =P(x(1) < 0), where x(1) is defined in (9) and assuming without loss of generality that si = 1. From(14) it follows that

E[eλx

(1)]≤ elµλ+lλ2 ≤ e−(1/4)lµ2 ,

where we choose λ = −µ/2. By Chernoff’s inequality, this implies the lemma.

25

References

[AS08] N. Alon and J. H. Spencer, The probabilistic method, John Wiley, 2008.

[Bol01] B. Bollobas, Random Graphs, Cambridge University Press, January 2001.

[DS79] A. P. Dawid and A. M. Skene, Maximum likelihood estimation of observer error-ratesusing the em algorithm, Journal of the Royal Statistical Society. Series C (AppliedStatistics) 28 (1979), no. 1, 20–28.

[IPW10] P. G. Ipeirotis, F. Provost, and J. Wang, Quality management on amazon mechanicalturk, Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP’10, ACM, 2010, pp. 64–67.

[JG03] R. Jin and Z. Ghahramani, Learning with multiple labels, Advances in neural informa-tion processing systems, 2003, pp. 921–928.

[KOS11] D. R. Karger, S. Oh, and D. Shah, Budget-optimal crowdsourcing using low-rank matrixapproximations, Proc. of the Allerton Conf. on Commun., Control and Computing,2011.

[MM09] M. Mezard and A. Montanari, Information, physics, and computation, Oxford Univer-sity Press, Inc., New York, NY, USA, 2009.

[Pea88] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann Publ., SanMateo, Califonia, 1988.

[RU08] T. Richardson and R. Urbanke, Modern Coding Theory, Cambridge University Press,march 2008.

[RYZ+10] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy,Learning from crowds, J. Mach. Learn. Res. 99 (2010), 1297–1322.

[SFB+95] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi, Inferring ground truth fromsubjective labelling of venus images, Advances in neural information processing systems,1995, pp. 1085–1092.

[SPI08] V. S. Sheng, F. Provost, and P. G. Ipeirotis, Get another label? improving data qualityand data mining using multiple, noisy labelers, Proceeding of the 14th ACM SIGKDDinternational conference on Knowledge discovery and data mining, KDD ’08, ACM,2008, pp. 614–622.

[WBBP10] P. Welinder, S. Branson, S. Belongie, and P. Perona, The multidimensional wisdom ofcrowds, Advances in Neural Information Processing Systems, 2010, pp. 2424–2432.

[WRW+09] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, Whose vote should countmore: Optimal integration of labels from labelers of unknown expertise, Advances inNeural Information Processing Systems, vol. 22, 2009, pp. 2035–2043.

[WS67] G. Wyszecki and W. S. Stiles, Color science: Concepts and methods, quantitative dataand formulae, Wiley-Interscience, 1967.

26

[YFW03] J. S. Yedidia, W. T. Freeman, and Y. Weiss, Understanding belief propagation and itsgeneralizations, pp. 239–269, Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, 2003.

27

budget-optimal task allocation for reliable crowdsourcing ... · background. crowdsourcing systems...

Documents