crowdsourcing: (a bit of) theory and ((quite) some) practice · other examples i ligue de...

Post on 11-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Crowdsourcing:(a bit of) theory and ((quite) some) practice

Karen Fort

karen.fort@paris-sorbonne.fr

enetCollect, September 7th, 2017

1 / 56

Where I’m talking fromMain research interests (http://karenfort.org/)

I Language resources creation for natural language processing

I Ethics in natural language processing (NLP)

2 / 56

Where I’m talking fromRelated actions

Games (with a purpose) I participated in creating:

Language games portal and recurring workshop:

Games4NLP

3 / 56

Crowdsourcing: back to basics

Games with a purpose (GWAPs)

GWAP-ing in practice: ZombiLingo

Conclusion

4 / 56

Crowdsourcing: back to basicsDefinitionCrowdsourcingSBeyond the myths: ”Crowdsourcing is recent”Beyond the myths: ”Crowdsourcing implies a crowd”Beyond the myths: ”Crowdsourcing implies non-experts”

Games with a purpose (GWAPs)

GWAP-ing in practice: ZombiLingo

Conclusion

5 / 56

Crowdsourcing

Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]

I no a priori identification or selection of the participants(”open call”)

I massive (in production and participation)

I (relatively) cheap

6 / 56

Crowdsourcing

Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]

I no a priori identification or selection of the participants(”open call”)

I massive (in production and participation)

I (relatively) cheap

7 / 56

Crowdsourcing

Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]

I no a priori identification or selection of the participants(”open call”)

I massive (in production and participation)

I (relatively) cheap

8 / 56

Some remarkable achievements

Wikipedia1 (September 2017) :

I more than 45 million articles in 241 languages

I more than 8 million views per hour for the English version(2014 - could not find more recent data)

Distributed Proofreaders (Gutenberg Project) 2 :

I nearly 40,000 digitalized and corrected books

JeuxDeMots [Lafourcade, 2007]3:

I more than 150 million relations in the lexical network

I more than 2 million terms added by the players

1http://stats.wikimedia.org/EN/Sitemap.htm2http://www.pgdp.net/c/stats/stats_central.php3http://www.jeuxdemots.org

9 / 56

A simplified taxonomy (more in [Geiger et al., 2011])

remuneratednot remunerated

direct

indirect

10 / 56

A simplified taxonomy (more in [Geiger et al., 2011])

remuneratednot remunerated

direct

indirect

11 / 56

A simplified taxonomy (more in [Geiger et al., 2011])

remuneratednot remunerated

direct

indirect

12 / 56

A simplified taxonomy (more in [Geiger et al., 2011])

remuneratednot remunerated

direct

indirect

13 / 56

Myth #1: ”Crowdsourcing is recent”Instructions pour les voyageurs et les employes des colonies

Citizen science:

I published by the Museum Nationald’Histoire Naturelle (Paris, France)

I for the travellers and the colonies’employees:

”to share the results oftheir own experiences, tobenefit themselves andthe scientific world”

I first published in 1824

14 / 56

Other examples

I Ligue de Protection des Oiseaux (birds protectionleague)4:

I monitoring of the birds populationsI created more than a century agoI 5,000 active participants

I Longitude Prize5 (1714):I awarded by the British governement to whoever would invent a

simple and practical method allowing to determine a ship’slongitude

I still exists: in 2014 the theme was ”Global antibioticsresistance”

4https://www.lpo.fr5https://longitudeprize.org/

15 / 56

Myth #2: ”Crowdsourcing implies a crowd of participants”

1 10 20 30 40 50 60 70 80 90

100 000

200 000

300 000

400 000

500 000

Players ranked according to their score

Nu

mb

erof

poi

nts

Number of points per player

Number of players on Phrase Detectives according to the number of points

gained in the game (February 2011 - February 2012)

16 / 56

A crowd of participants? JeuxDeMots

20 100 200 300 400 500 600

250 000

500 000

750 000

1 000 000

Players ranked according to their score

Nu

mb

erof

poi

nts

Number of points per player

Number of players on JeuxDeMots according to their ranking in the game

(source : http://www.jeuxdemots.org/generateRanking-4.php)

17 / 56

A crowd of participants? ZombiLingo

18 / 56

A crowd of workers? [Fort et al., 2011]

Number of active Turkers on Amazon Mechanical Turk:

I number of workers registered on the website: more than500,000

I 80% of the tasks (HIT) are performed by the 20% most activeTurkers [Deneme, 2009]

⇒ really active workers: between 15,059 and 42,912

19 / 56

Experts vs non-experts

Example of the annotation of named entities in a microbiologycorpus:

I experts of the domain?I of the corpus (microbiology)?I of the application (NLP)?

20 / 56

Experts vs non-experts

Example of the annotation of named entities in a microbiologycorpus:

I experts of the domain?I of the corpus (microbiology)?I of the application (NLP)?

→ experts of the task

21 / 56

Crowdsourcing

Using a crowd of ”non-experts”?

22 / 56

Crowdsourcing

Using a crowd of ”non-experts”?

→ Finding/training experts (of the task) in the crowd

23 / 56

Crowdsourcing: back to basics

Games with a purpose (GWAPs)Using the (basic) knowledge of the crowdUsing the basic education of the crowdUsing the learning capabilities of the crowd

GWAP-ing in practice: ZombiLingo

Conclusion

24 / 56

Games with a purpose (GWAPs)

Allow to benefit from:

1. the knowledge of the ”world”

2. the basic education

3. the learning capabilities

of the participants (seldom a crowd)

25 / 56

JeuxDeMots: playing association of ideas. . .. . . to create a lexical network [Lafourcade and Joubert, 2008]

More than 154 million relations (created by approx. 6,000 players),that are constantly updated

I play by pairs

I more and more complex,typed relations

I challenges

I lawsuits

I etc.

26 / 56

Phrase Detectives: playing detective. . .. . . to annotate co-reference [Chamberlain et al., 2008]

3.5M decisions from 45k players

I pre-annotated corpus

I detailed instructions

I trainingI 2 different playing modes

I annotationI validation (correction of

annotations)

27 / 56

FoldIt: playing proteins folding. . .. . . to solve scientific issues [Khatib et al., 2011]

Solution to the crystal structure of a monomeric retroviral protease(simian AIDS-causing monkey virus)

Solution to an issue unsolved forover a decade

I found in a couple of weeks

I by a team of players

I that will allow for thecreation of antiretroviraldrugs

28 / 56

FoldIt: playing proteins folding. . .. . . without any prior knowledge in biochemistry [Cooper et al., 2010]

Step-by-step training

I tutorial decomposed by concepts

I puzzles for each concept

I access to the following puzzles is given only if your level issufficient

29 / 56

Crowdsourcing: back to basics

Games with a purpose (GWAPs)

GWAP-ing in practice: ZombiLingoOverview of the gameMotivating playersBehind the curtainEnsuring qualityResults

Conclusion

30 / 56

A complex annotation task

I annotation guidelinesI 29 relation typesI approx. 50 pages

I counter-intuitive decisions (not school grammar, linguistics):aobj = au

[...] avoir recours au type de mesures [...]

i.e. head of the PP is the preposition

→ decompose the complexity of the task [Fort et al., 2012],not simplify it!

31 / 56

http://zombilingo.org/

32 / 56

33 / 56

34 / 56

35 / 56

General features

Bring the fun through:

I zombie design

I use of (crazy) objectsI regular challenges (specific corpus and design) on a trendy

topic:I Star Wars (when the movie was playing)I soccer (during the Euro)I Pokemon (well...)

36 / 56

LeaderboardS (for achievers)

Criteria:

I number of annotations or points

I in total, during the month, during the challenge

37 / 56

Hidden features (for explorers)

I appearing randomly

I with different effects: objects, other game, etc.

38 / 56

Duels (for socializers (and killers?))

I select an enemy

I challenge them on a specific type of relation

39 / 56

Badges (?) (for collectors)

I play all the sentences for a relation type, for a corpus

I play all the sentences from a corpus

40 / 56

Organizing quality assurance

Una

nnot

ated

cor

pus

(Wik

iped

ia)

Ref c

orpu

s (S

equo

ia)

Play phaseTraining phase

REFTrain & Control

REFEval Eval

Raw text ANNOTATION(no feedback)

Pre annotation with 2 parsers

Player’s confidence

EXPGame

TRAINING(feedback)

CONTROL(feedback)

EVAL(no feedback) EXPEval

41 / 56

Preprocessing data (freely available corpora)

Una

nnot

ated

cor

pus

(Wik

iped

ia)

Ref c

orpu

s (S

equo

ia)

Play phaseTraining phase

REFTrain & Control

REFEval Eval

Raw text ANNOTATION(no feedback)

Pre annotation with 2 parsers

Player’s confidence

EXPGame

TRAINING(feedback)

CONTROL(feedback)

EVAL(no feedback) EXPEval

42 / 56

Preprocessing data (freely available corpora)

Pre-annotation with two parsers

1. a statistical parser: Talismane [Urieli, 2013]

2. a symbolic parser, based on graph rewriting:FrDep-Parse [Guillaume and Perrier, 2015]

→ play the items for which the two parsers give differentannotations

43 / 56

Training, control and evaluationReference: 3,099 sentences of the Sequoia corpus [Candito and Seddah, 2012]

Ref c

orpu

s (S

equo

ia)

Play phaseTraining phase

REFTrain & Control

REFEval Eval

Player’s confidenceTRAINING

(feedback)CONTROL

(feedback)

EVAL(no feedback) EXPEval

REFTrain&Control REFEval Unused

50% 25% 25%1,549 sentences 776 sentences 774 sentences

I REFTrain&Control is used to train the players

I REFEval is used like a raw corpus, to evaluate the producedannotations

44 / 56

Training the players

Compulsory for each dependency relation

I sentences are taken from the REFTrain&Controlcorpus

I a feedback is given in case of error

45 / 56

Dealing with cognitive fatigue and long-term playersControl mechanism

Sentences from the REFTrain&Controlcorpus are proposed regularly

1. if the player fails to find the right answer, a feedback with thesolution is given

46 / 56

Dealing with cognitive fatigue and long-term playersControl mechanism

Sentences from the REFTrain&Controlcorpus are proposed regularly

1. if the player fails to find the right answer, a feedback with thesolution is given

2. after a given number of failures on the same relation, theplayer cannot play anymore and has to redo the correspondingtraining

47 / 56

Dealing with cognitive fatigue and long-term playersControl mechanism

Sentences from the REFTrain&Controlcorpus are proposed regularly

1. if the player fails to find the right answer, a feedback with thesolution is given

2. after a given number of failures on the same relation, theplayer cannot play anymore and has to redo the correspondingtraining

→ we deduce a level of confidence for the player on this relation

48 / 56

Production: game corpus sizecompared to other existing French dependency syntax corpora

As of July 10, 2016

I 647 players (1,150 as of Sept. 5th, 2017)

I who produced 107,719 annotations (more than 390,000 as ofSept. 5th, 2017)

Sequoia 7.0 UD-French 1.3 FTB-UC FTB-SPMRL Gamefree free not ”free” not ”free” free

validated after ZL6 + errors validated validated validated

Sent. 3,099 16,448 12,351 18,535 5,221

Tok. 67,038 401,960 350,947 557,149 128,046

+ (ever)growing resource!

6ZL 1.0, July 2014 vs UD 1.0 January 2015.49 / 56

Evaluating qualityon the REFEval corpus

aux.tps

sujaux.pass

aff detobj.cpl

aobj

mod.rel

dep.coord

obj.pats

pobj.o

deobj

coord

objm

od

0

0.5

1

F-m

easu

re

Talismane FrDep-Parse Game

NB: left part of the figure = density of annotation > 150 / 56

Annotation densityon the REFEval corpus

aux.tps

sujaux.pass

aff detobj.cpl

aobj

mod.rel

dep.coord

obj.pats

pobj.o

deobj

coord

objm

od

0

2

4

nu

mb

erof

answ

ers

per

ann

otat

ion

→ need more annotations on some relations

51 / 56

Next stepsExpand to new languages and new annotation types

New languages

I English

I less-resourced languages

New annotation types

I part-of-speech (POS),

I corpus building,

I etc.

Alice Millour (PhD student)

52 / 56

Crowdsourcing: back to basics

Games with a purpose (GWAPs)

GWAP-ing in practice: ZombiLingo

Conclusion

53 / 56

Crowdsourcing for language resources creation

Achievements

I surprisingly good results in terms of quantity and quality

I we demonstrated that we can train people on a complex task

Difficulties

I motivating people → language learning could work (?7)

I communication / advertisement

I gamification

7There is no (yet) scientific evidence that it does.54 / 56

https://github.com/zombilingo

http://zombilingo.org/export

55 / 56

ZombiLingo’s team and fundings

Bruno Guillaume (researcher)

Nicolas Lefebvre (engineer)

56 / 56

AppendixAmazon Mechanical Turk: a platform of legendsEthical issuesVarying costsComplexity of the tasks

Bibliographie

History: von Kempelen’s Mechanical Turk

A mechanical chess player created by J. W. von Kempelen in 1770

History: von Kempelen’s Mechanical Turk

In fact, a human chess master was hiding inside to operate themachine

History: von Kempelen’s Mechanical Turk

It’s artificial artificial intelligence!

And Amazon created AMT

Amazon created for its own needs a

microworking platformand opened it to everyone in 2005

(taking 10% of the transactions (20% now))

AMT: the dream come true?for NLP

[Snow et al., 2008]

AMT: the dream come true?for NLP

[Snow et al., 2008]

It’s very cheap, fast, good quality work

and it’s a hobby for the Turkers!

Amazon Mechanical Turk (AMT)

MTurk

Amazon Mechanical Turk (AMT)

MTurk is a crowdsourcing system: work outsourced via the Web,done by many people (the crowd), here, the Turkers

Amazon Mechanical Turk (AMT)MTurk is a crowdsourcing, microworking system: tasks are cut intosmall pieces (HITs) and their execution is paid for by theRequesters

Amazon Mechanical Turk (AMT)

MTurk is a crowdsourcing, microworking system: tasks are cut into

small pieces (HITs) and their execution is paid for.

Amazon Mechanical Turk (AMT)

MTurk is a crowdsourcing, microworking system: tasks are cut into

small pieces (HITs) and their execution is paid for.

Is AMT Ethical and/or legal?

Ethics:

I No identification: no relation between Requesters and Turkersand among Turkers

I No possibility to unionize, to protest against wrongdoings orto go to court.

I No minimal wage (< 2$/hr in average)

I Possibility to refuse to pay the Turkers

Is AMT Ethical and/or legal?

AMT: a hobby for the Turkers ?

[Ross et al., 2010, Ipeirotis, 2010] show that:I Turkers are priorly financially motivated (91%):

I 20% use AMT as their primary source of income;I 50% as their secondary source of income;I leisure is important for only a minority (30%).

I 20% of the Turkers spend more than 15 hour a week onAMT, and contribute to 80% of the tasks.

I observed mean hourly wages is below US$ 2.

[Gupta et al., 2014]: given that training the workers is impossibleon AMT, an important part of the work of the Turkers is hidden

Turkers are not anonymous [Lease et al., 2013]The Turkers’ id corresponds to their Amazon client id

AMT allows to reduce costs

Very low wages ⇒ low costs? Yes, but. . .

I UI development

I creation of protections against spammers

I validation and post-processing costs

Some tasks (for example, in less-resourced languages) generatecosts that are similar to the usual costs in the domain, due to alack of qualified Turkers [Novotney and Callison-Burch, 2010].

Depending on an external platform

Impossible to control:

I the costs

I the participants’ working conditions

I the selection of participants

I the conditions of the experiment

HITs (Human Intelligence Tasks): simplified tasks

AMT does not allow for the training of the workers:

I quality is not satisfactory for complexe tasks (for example,summarizing [Gillick and Liu, 2010])

I for some simple tasks, NLP tools produce better results thanAMT [Wais et al., 2010].

⇒ Simplification of the tasks:I a real textual entailment annotation task (inference, neutral,

contradiction) is reduced to 2 phases and 1 question:”Would most people say that if the first sentence is true, thenthe second sentence must be true?”[Bowman et al., 2015]

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.(2015).A large annotated corpus for learning natural languageinference.arXiv preprint arXiv:1508.05326.

Candito, M. and Seddah, D. (2012).Le corpus Sequoia : annotation syntaxique et exploitation pourl’adaptation d’analyseur par pont lexical.In Proceedings of the Traitement Automatique des LanguesNaturelles (TALN), Grenoble, France.

Chamberlain, J., Poesio, M., and Kruschwitz, U. (2008).Phrase Detectives: a web-based collaborative annotationgame.In Proceedings of the International Conference on SemanticSystems (I-Semantics’08), Graz, Austria.

Cooper, S., Treuille, A., Barbero, J., Leaver-Fay, A., Tuite, K.,Khatib, F., Snyder, A. C., Beenen, M., Salesin, D., Baker, D.,and Popovic, Z. (2010).The challenge of designing scientific discovery games.In Proceedings of the Fifth International Conference on theFoundations of Digital Games, FDG ’10, pages 40–47, NewYork, NY, USA. ACM.

Deneme (2009).How many turkers are there?http://groups.csail.mit.edu/uid/deneme/.

Fort, K., Adda, G., and Cohen, K. B. (2011).Amazon Mechanical Turk: Gold mine or coal mine?Computational Linguistics (editorial), 37(2):413–420.

Fort, K., Nazarenko, A., and Rosset, S. (2012).Modeling the complexity of manual annotation tasks: a grid ofanalysis.

In International Conference on Computational Linguistics(COLING), pages 895–910, Mumbai, India.

Geiger, D., Seedorf, S., Schulze, T., Nickerson, R. C., andSchader, M. (2011).Managing the crowd: Towards a taxonomy of crowdsourcingprocesses.In AMCIS 2011 Proceedings.

Gillick, D. and Liu, Y. (2010).Non-expert evaluation of summarization systems is risky.In Proceedings of the NAACL HLT 2010 Workshop on CreatingSpeech and Language Data with Amazon’s Mechanical Turk,CSLDAMT ’10, pages 148–151, Stroudsburg, PA, USA.Association for Computational Linguistics.

Guillaume, B. and Perrier, G. (2015).Dependency Parsing with Graph Rewriting.

InProceedings of IWPT 2015, 14th International Conference on Parsing Technologies,pages 30–39, Bilbao, Spain.

Gupta, N., Martin, D., Hanrahan, B. V., and O’Neill, J.(2014).Turk-life in india.In Proceedings of the 18th International Conference onSupporting Group Work, GROUP ’14, pages 1–11, New York,NY, USA. ACM.

Howe, J. (2006).The rise of crowdsourcing.Wired Magazine, 14(6).

Ipeirotis, P. (2010).The new demographics of mechanical turk.http://behind-the-enemy-lines.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html.

Khatib, F., DiMaio, F., Cooper, S., Kazmierczyk, M., Gilski,M., Krzywda, S., Zabranska, H., Pichova, I., Thompson, J.,Popovic, Z., et al. (2011).Crystal structure of a monomeric retroviral protease solved byprotein folding game players.Nature structural & molecular biology, 18(10):1175–1177.

Lafourcade, M. (2007).Making people play for lexical acquisition.In 7th Symposium on Natural Language Processing (SNLP2007), Pattaya, Thailand.

Lafourcade, M. and Joubert, A. (2008).JeuxDeMots : un prototype ludique pour l’emergence derelations entre termes.In Proceedings of the Journees internationales d’Analysestatistique des Donnees Textuelles (JADT), Lyon, France.

Lease, M., Hullman, J., Bigham, J. P., Bernstein, M. S., Kim,J., Lasecki, W., Bakhshi, S., Mitra, T., and Miller, R. C.(2013).Mechanical turk is not anonymous.Technical report.

Novotney, S. and Callison-Burch, C. (2010).Cheap, fast and good enough: automatic speech recognitionwith non-expert transcription.In Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL), HLT’10,pages 207–215, Stroudsburg, PA, USA. Association forComputational Linguistics.

Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., andTomlinson, B. (2010).Who are the crowdworkers?: shifting demographics inmechanical turk.

In Proceedings of the 28th of the international conferenceextended abstracts on Human factors in computing systems,CHI EA ’10, pages 2863–2872, New York, NY, USA. ACM.

Snow, R., O’Connor, B., Jurafsky, D., and Ng., A. Y. (2008).Cheap and fast - but is it good? evaluating non-expertannotations for natural language tasks.In Proceedings of EMNLP 2008, pages 254–263.

Urieli, A. (2013).Robust French syntax analysis: reconciling statistical methodsand linguistic knowledge in the Talismane toolkit.PhD thesis, Universite de Toulouse II le Mirail, France.

Wais, P., Lingamneni, S., Cook, D., Fennell, J., Goldenberg,B., Lubarov, D., Marin, D., and Simons, H. (2010).Towards building a high-qualityworkforce with mechanicalturk.In Proceedings of Computational Social Science and theWisdom of Crowds (NIPS).

top related