human computation, crowdsourcing and social: an industrial perspective

Crowdsourcing AWS

Omar AlonsoMicrosoft

12 November 2014Human computation, crowdsourcing and social: An industrial perspective1DisclaimerThe views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.2IntroductionCrowdsourcing is hotLots of interest in the research communityArticles showing good resultsJournals special issues (IR, IEEE Internet Computing, etc.)Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI, RecSys, VLDB, etc.)HCOMPCrowdConfLarge companies leveraging crowdsourcingBig dataStart-upsVenture capital investmentCrowdsourcing

Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.The application of Open Source principles to fields outside of software.Most successful story: Wikipedia

HUMAN ComputationHuman computation

Not a new ideaComputers before computersYou are a human computerSome definitionsHuman computation is a computation that is performed by a human Human computation system is a system that organizes human efforts to carry out computationCrowdsourcing is a tool that a human computation system can use to distribute tasks.

Edith Law and Luis von Ahn. Human Computation. Morgan & Claypool Publishers, 2011.More examplesESP gameCaptcha: 200M every dayReCaptcha: 750M to date

Data is kingMassive free Web data changed how we train learning systemsCrowds provide new access to cheap & labeled big dataBut quality also matters

M. Banko and E. Brill. Scaling to Very Very Large Corpora for Natural Language Disambiguation, ACL 2001. A. Halevy, P. Norvig, and F. Pereira. The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 2009.

Traditional Data CollectionSetup data collection software / harnessRecruit participants / annotators / assessorsPay a flat fee for experiment or hourly wage

CharacteristicsSlowExpensiveDifficult and/or TediousSample Bias12Natural Language ProcessingMTurk annotation for 5 NLP tasks22K labels for US $26High agreement between consensus labels and gold-standard labels Workers as good as expertsR. Snow, B. OConnor, D. Jurafsky, and A. Y. Ng. Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP-2008. Machine TranslationManual evaluation on translation quality is slow and expensiveHigh agreement between non-experts and experts$0.10 to translate a sentence

C. Callison-Burch. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazons Mechanical Turk, EMNLP 2009.

Soylent

M. Bernstein et al. Soylent: A Word Processor with a Crowd Inside, UIST 2010Mechanical TurkAmazon Mechanical Turk (AMT, MTurk, www.mturk.com)Crowdsourcing platformOn-demand workforceArtificial artificial intelligence: get humans to do hard partNamed after faux automaton of 18th C.

Multiple Channels Gold-based testsOnly pay for trusted judgments

HIT example

HIT example

{where to go on vacation}

MTurk: 50 answers, $1.80Quora: 2 answersY! Answers: 2 answersFB: 1 answerTons of resultsRead title + snippet + URLExplore a few pages in detail

{where to go on vacation}

CountriesCitiesFlip a coinPlease flip a coin and report the resultsTwo questionsCoin type?Head or tails?ResultsRow LabelsCountshead57tail43Grand Total100Row LabelsCountDollar56Euro11Other30(blank)3Grand Total100Why is this interesting?Easy to prototype and test new experimentsCheap and fastNo need to setup infrastructureIntroduce experimentation early in the cycleFor new ideas, this is very helpfulCaveats and clarificationsTrust and reliabilityWisdom of the crowd re-visitAdjust expectationsCrowdsourcing is another data point for your analysisComplementary to other experimentsWhy now?The WebUse humans as processors in a distributed systemAddress problems that computers arent goodScaleReachWho are the workers?

A. Baio, November 2008. The Faces of Mechanical Turk.P. Ipeirotis. March 2010. The New Demographics of Mechanical TurkJ. Ross, et al. Who are the Crowdworkers? CHI 2010.

Issues

ASSESSMENTS and LabelsRelevance assessmentsIs this document relevant to the query?

29

Careful with That Axe Data, EugeneIn the area of big data and machine learning: labels -> features -> predictive model -> optimizationLabeling/experimentation perceived as boringDont rush labelingHuman and machineLabel quality is very important Dont outsource itOwn it end to endLarge scaleMore on label qualityData gathering is not a free lunchLabels for the machine != labels for humansEmphasis on algorithms, models/optimizations and mining from labelsNot so much on algorithms for ensuring high quality labelsTraining sets

The importance of labels IR context

Information Retrieval and Crowdsourcing

Motivating Example: Relevance JudgingRelevance of search results is difficult to judgeHighly subjectiveExpensive to measureProfessional editors commonly usedPotential benefits of crowdsourcingScalability (time and cost)Diversity of judgments

Started with a joke Results for {idiot}February 2011: 5/7 (R), 2/7 (NR)RelevantMost of the time those TV reality stars have absolutely no talent. They do whatever they can to make a quick dollar. Most of the time the reality tv stars don not have a mind of their own. RMost are just celebrity wannabees. Many have little or no talent, they just want fame. RHave you seen the knuckledraggers on reality television? They should be required to change their names to idiot after appearing on the show. You could put numbers after the word idiot so we can tell them apart. RAlthough I have not followed too many of these shows, those that I have encountered have for a great part a very common property. That property is that most of the participants involved exhibit a shallow self-serving personality that borders on social pathological behavior. To perform or act in such an abysmal way could only be an act of an idiot. RI can see this one going both ways. A particular sort of reality star comes to mind, though, one who was voted off Survivor because he chose not to use his immunity necklace. Sometimes the label fits, but sometimes it might be unfair. RNot Relevant

Just because someone else thinks they are an "idiot", doesn't mean that is what the word means. I don't like to think that any one person's photo would be used to describe a certain term. NR While some reality-television stars are genuinely stupid (or cultivate an image of stupidity), that does not mean they can or should be classified as "idiots." Some simply act that way to increase their TV exposure and potential earnings. Other reality-television stars are really intelligent people, and may be considered as idiots by people who don't like them or agree with them. It is too subjective an issue to be a good result for a search engine. NR

You have a new ideaNovel IR techniqueDont have access to click dataCant hire editorsHow to test new ideas?Crowdsourcing and relevance evaluationSubject pool access: no need to come into the labDiversityLow cost Agile

Pedal to the metalYou read the papersYou tell your boss (or advisor) that crowdsourcing is the way to goYou now need to produce hundreds of thousands of labels per monthEasy, right?

Ask the right questionsInstructions are keyWorkers are not IR experts so dont assume the same understanding in terms of terminology Show examplesHire a technical writerPrepare to iterate

How not to do thingsLot of work for a few centsGo here, go there, copy, enter, count

UX designTime to apply all those usability conceptsNeed to grab attentionGeneric tipsExperiment should be self-contained. Keep it short and simple.Be very clear with the task. Engage with the worker. Avoid boring stuff.Always ask for feedback (open-ended question) in an input box.LocalizationPaymentsHow much is a HIT?Delicate balanceToo little, no interestToo much, attract spammersHeuristicsStart with something and wait to see if there is interest or feedback (Ill do this for X amount)Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory)BonusManaging crowds

47Quality controlExtremely important part of the experimentApproach it as overall quality not just for workersBi-directional channelYou may think the worker is doing a bad job.The same worker may think you are a lousy requester.Test with a gold standard

When to assess work quality?Beforehand (prior to main task activity)How: qualification tests or similar mechanismPurpose: screening, selection, recruiting, trainingDuringHow: assess labels as worker produces themLike random checks on a manufacturing linePurpose: calibrate, reward/penalize, weightAfterHow: compute accuracy metrics post-hocPurpose: filter, calibrate, weight, retain

How do we measure work quality?Compare workers label vs.Known (correct, trusted) labelOther workers labelsModel predictions of workers and labelsVerify workers labelYourselfTiered approach (e.g. Find-Fix-Verify)Methods for measuring agreementInter-agreement levelAgreement between judgesAgreement between judges and the gold setSome statisticsCohens kappa (2 raters)Fleiss kappa (any number of raters)Krippendorffs alphaGray areas2 workers say relevant and 3 say not relevant2-tier systemContent qualityPeople like to work on things that they likeContent and judgments according to modern timesTREC data set: airport security docs are pre 9/11Document lengthRandomize contentAvoid worker fatigueJudging 100 documents on the same subject can be tiring, leading to decreasing quality

Was the task difficult?Ask workers to rate difficulty of a search topic 50 topics; 5 workers, $0.01 per task

So far One may say this is all good but looks like a ton of workThe original goal: data is kingData quality and experimental designs are preconditions to make sure we get the right stuffDont cut cornersPauseCrowdsourcing worksFast turnaround, easy to experiment, few dollars to testBut: you have to design experiments carefully, quality, platform limitations Crowdsourcing in productionLarge scale data sets Continuous executionDifficult to debugHow do you know the experiment is workingGoal: framework for ensuring reliability on crowdsourcing tasks

O. Alonso, C. Marshall and M. Najork. Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results http://research.microsoft.com/apps/pubs/default.aspx?id=219755.

Labeling tweets an example of a taskIs this tweet interesting?Subjective activityNot focused on specific eventsFindingsDifficult problem, low inter-rater agreementTested many designs, number of workers, platforms (MTurk and others)Multiple contingent factorsWorker performanceWorkTask designO. Alonso, C. Marshall & M. Najork. Are some tweets more interesting than others? #hardquestion. HCIR 2013.56Designs that include in-task CAPTCHABorrowed idea from reCAPTCHA -> use of control termHIDDENAdapt your labeling task2 more questions as control1 algorithmic1 semantic

57Production example #1

Q1 (k = 0.91, alpha = 0.91)

Q2 (k = 0.771, alpha = 0.771)Q3 (k = 0.033, alpha = 0.035)

In-task captchaTweet de-brandedThe main question

Production example #2

Q3 Worthless (alpha = 0.033)Q3 Trivial (alpha = 0.043)Q3 Funny (alpha = -0.016)Q3 Makes me curious (alpha = 0.026)Q3 Contains useful info (alpha = 0.048)Q3 Important news (alpha = 0.207)

Q2 (k = 0.728, alpha = 0.728)Q1 (k = 0.907, alpha = 0.907)In-task captchaBreakdown by categories to get better signalTweet de-branded

Once we get hereHigh quality labelsData will be later be used for rankers, ML models, evaluations, etc.Training setsScalability and repeatability Current TRENDSAlgorithmsBandit problems; explore-exploitOptimizing amount of work by workersHumans have limited throughputHarder to scale than machinesSelecting the right crowdsStopping ruleHumans in the loopComputation loops that mix humans and machinesKind of active learningDouble goal:Human checking on the machineMachine checking on humansExample: classifiers for social data

RoutingExpertise detection and routingSocial load balancingWhen to switch between machines and humansCrowdSTARB. Nushi, O. Alonso, M. Hentschel, V. Kandylas. CrowdSTAR: A Social Task Routing Framework for Online Communities, 2014. http://arxiv.org/abs/1407.6714Social Task RoutingCrowd 1 (Twitter)Crowd 2 (Quora)C1Crowd SummariesC2Task A?

Task B?Routing across crowdsRouting within a crowdQuestion Posting Twitter Examples

ConclusionsCrowdsourcing at scale works but requires a solid frameworkFast turnaround, easy to experiment, few dollars to testBut you have to design the experiments carefullyUsability considerationsLots of opportunities to improve current platformsThree aspects that need attention: workers, work and task designLabeling social data is hard

Conclusions IIImportant to know your limitations and be ready to collaborateLots of different skills and expertise requiredSocial/behavioral scienceHuman factorsAlgorithmsEconomicsDistributed systems StatisticsThank you - @elunca

human computation, crowdsourcing and social: an industrial perspective

Software

human human computation

human efforts

date data

translation quality

natural language tasks

consensus labels

kingmassive free web

amazons mechanical turk