how to grade a test without knowing the answers — a bayesian

Download How To Grade a Test Without Knowing the Answers — A Bayesian

Post on 01-Jan-2017




0 download

Embed Size (px)


  • How To Grade a Test Without Knowing the Answers A BayesianGraphical Model for Adaptive Crowdsourcing and Aptitude Testing

    Yoram Bachrach

    Microsoft Research, Cambridge, UK

    Tom Minka

    Microsoft Research, Cambridge, UK

    John Guiver

    Microsoft Research, Cambridge, UK

    Thore Graepel

    Microsoft Research, Cambridge, UK


    We propose a new probabilistic graphicalmodel that jointly models the difficulties ofquestions, the abilities of participants and thecorrect answers to questions in aptitude test-ing and crowdsourcing settings. We devisean active learning/adaptive testing schemebased on a greedy minimization of expectedmodel entropy, which allows a more efficientresource allocation by dynamically choosingthe next question to be asked based on theprevious responses. We present experimentalresults that confirm the ability of our modelto infer the required parameters and demon-strate that the adaptive testing scheme re-quires fewer questions to obtain the same ac-curacy as a static test scenario.

    1. Introduction

    Collective decision making is a well-studied topic insocial choice, voting and artificial intelligence. It haslong been known that decisions based on aggregatingthe opinions of several agents can be of higher qualitythan those based on the opinions of single individuals.The Condorcet Jury Theorem (de Caritat et al., 1785),dating back to the 18th century, is concerned with agroup of individuals attempting to reach a binary de-cision by majority vote; it is assumed that one of the

    Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

    two outcomes of the vote is correct, and that each in-dividual independently chooses the correct responsewith probability p. The theorem states that if p > 12 ,then adding more agents increases the probability ofmaking the correct decision, and the probability thatthe collective decision will be correct approaches 1in the limit of infinitely many participating agents.

    Recent technological advances make it easier to shareopinions and knowledge, enabling us to harness thecollective intelligence of crowds for solving tasks.Companies can use crowdsourcing to carry out busi-ness tasks, using platforms such as Amazon Mechani-cal Turk. Such services allow us to collect the opinionsof many individuals, but leave open the question ofhow to aggregate the collected data to reach decisions.

    A key technique for solving tasks using collectiveintelligence is to obtain information from multiplesources and aggregate it into a single complete solu-tion. Consider a crowd of experts who are assignedwith many similar classification tasks, such as classi-fying many news articles according to their topic (pol-itics, business, entertainment etc.). We refer toeach expert as a participant and to each classificationtask as a question. Suppose each participant expressesher opinion regarding the correct answer for each ques-tion, in the form of a response, chosen by her from thelist of possible answers for that question. Similarlyto Condorcets Jury Theorem, we make the simplify-ing assumption that for each of the questions only oneanswer is correct. We call such a domain a multipleproblem domain. Given the responses provided by theparticipants regarding the various questions in a mul-

  • How To Grade a Test Without Knowing the Answers

    tiple problem domain, how should we best determinethe correct answer for each of the items? Which ques-tions are easy and which are hard? How can we findthe most competent participants in the crowd? Whichquestions best test the ability of a participant?

    Given the correct answers to each of the items, it iseasy to find the most competent participants, or dif-ferentiate the easy and hard questions a participantwho has answered almost all the questions correctly islikely to be more skilled than a participant who hadvery few correct responses. Typically, however, thecorrect answers are not known in advance the wholepoint of crowdsourcing a classification task is to deter-mine the correct classifications for the items.

    A possible solution to the above problem is to firstevaluate the skill of each expert by asking her to pro-vide responses to a set of items for which the correctanswer is known (sometimes called a gold-set). Aprominent example of this, where all correct answersare known, is intelligence testing (Anastasi et al.,1997). Psychologists have studied human intelligence,and designed IQ tests for evaluating the aptitude ofindividuals. These tests have been shown to be predic-tive of a persons performance in many domains, suchas academic attainment and job success (Anastasiet al., 1997). Such tests typically ask participants torespond to questionnaires composed of many multiple-choice questions, and allow ranking participants ac-cording to their individual skill levels after examiningthe responses. The properties of responses to IQ testshave been widely studied by psychometricians, so suchdatasets can serve as a testing ground for exploring in-ference models for multiple problem domains.

    Our Contribution: We propose a new family ofgraphical models for analyzing responses in multipleproblem domains and evaluate the models on a dataset of completed questionnaires of a standard IQ test.The proposed framework enables us to jointly inferthe correct answer for each question (when these arenot known in advance), the difficulty levels of the ques-tions, and the ability of each participant. We show howthe model can: determine a probability distributionover answers for a given question by aggregating theresponses of participants based on their abilities andthe questions difficulties; test the ability levels of par-ticipants efficiently by finding the best next questionto ask in an adaptive way, depending on the previousresponses; automatically calibrate aptitude tests froma set of questions and the responses provided by theparticipants, determining the relative difficulty levelsof the questions and their ability to discriminate be-tween participants of similar but uneven skill levels.

    2. Related Work

    Measuring intelligence is a key topic in psychology.Psychologists showed that peoples performance onmany cognitive tasks is strongly correlated, so asingle statistical factor called general intelligenceemerges (Anastasi et al., 1997). A measure for the per-formance of groups of people in joint tasks, called col-lective intelligence, was investigated (Woolley et al.,2010). This approach focuses on explicit collabora-tion and interaction between members of the crowd.Although in our setting participants do not interactdirectly, one can view our model as a method for infer-ring correct answers to questions given the responses ofa crowd of individuals. The number of correct answersinferred can serve as a measure of the intelligence of thecrowd. In this sense, our work is somewhat similar toother approaches which also use aggregated responsesto IQ tests for measuring collective intelligence (Lyle,2008; Bachrach et al., 2012; Kosinski et al., 2012).

    Psychometricians developed a body of theory calledtest theory which analyzes outcomes of psychologi-cal testing, such as the ability levels of participants orthe difficulty of questions in a test, trying to improvereliability in such tests (Anastasi et al., 1997). Oneparadigm for designing tests of mental abilities, is theitem-response theory (Hambleton et al., 1991) (IRTfor short). IRT has been used to develop high-stakesadaptive tests such as the Graduate Management Ad-mission Test (GMAT). IRT is based on the idea thatthe probability of a participant providing the correctresponse to a question is a function of both a param-eter of the question and a parameter of the item (forexample, the questions difficulty and the persons ap-titude). When applying aptitude tests, the parameterof the person is latent (cannot be directly observed),and only its manifestation, in the form of the partici-pants responses, can be directly observed. Our frame-work relies on a probabilistic graphical model (Koller& Friedman, 2009), using themes similar to IRT.

    Many papers deal with merging opinions, ranging frominformation aggregation in the semantic web (Kasneciet al., 2010) to prediction markets (Pennock & Sami,2007). Frameworks such as Probabilistic RelationalModels (Getoor et al., 2007) combine a logical repre-sentation with probabilistic semantics, and allow in-ference to aggregate information and opinions. Onebasic method for collective decision making is voting.Voting was studied in social choice theory (Sen, 1986),which focuses on how participants can manipulate bylying about their preferences. We assume that theexperts responses are their true opinion and focus onthe inference problem. One application of our model is

  • How To Grade a Test Without Knowing the Answers

    aggregating crowdsourced opinions. A machine learn-ing approach for doing so which does not model taskdifficulty was proposed in (Raykar et al., 2010) and atechnique that models task difficulty but uses an EMapproach was proposed in (Whitehill et al., 2009). An-other method based on graphical models is (Welinderet al., 2010). In that model questions are endowedwith features, which could represent concepts or top-ics and participants have different areas of expertisematching these topics. Our model focuses on a gen-eral domain in the spirit of test theory and IRT, anddoes not rely on specific features. An active learningapproach for labeling data was proposed in (Yan et al.,2011) and is similar to our adaptive IQ testing tech-nique. Another approach akin to ours is the TrueSkillsystem (Herbrich et al., 2007), which uses a gra