crowdsourcing massimo poesio part 2: games with a purpose
TRANSCRIPT
CROWDSOURCING
Massimo Poesio
Part 2: Games with a Purpose
GAMES WITH A PURPOSE
• Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
• GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
• The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
• Games at www.gwap.com– ESP– Verbosity– TagATune
• Other games– Peekaboom– Phetch
ESP
• The first GWAP developed by von Ahn and their group (2003 / 2004)
• The problem: obtain accurate description of images to be used– To train image search engines– To develop machine learning approaches to vision
• The goal: label the majority of the images on the Web
ESP: the game
ESP: THE GAME
• Two partners are picked at random from the large number of players online
• They are not told who their partner is, and can’t communicate with them
• They are both shown the same image• The goal: guess how their partner will describe the
image, and type that description– Hence, the ESP game
• If any of the strings typed by one player matches the string typed by the other player, they score points
THE TASK
SCORING BY MATCHING
THE CHALLENGE: SCORES
• One of the motivating factors is to try to score as many points as possible
• Hourly, daily, weekly, and monthly scores are shown
SCORES
THE CHALLENGE: TIMING
• Partners try to agree on as many images as they can during 2 ½ minutes
• The termometer on the side indicates how many images they have agreed on
• If they agree on 15 images they score bonus points
TABOO WORDS
• To ensure the production of a large number of specific labels, some words are declared TABOO and not allowed
• Taboo words are obtained from the game itself: any word that has been agreed upon by players who were shown a picture earlier becomes a taboo word for that image
TABOO WORDS
PASSING
GOOD LABELS, COMPLETING AN IMAGE
• A label is considered “good” when more than N players produce it (with N a parameter of the game)
• An image is “done” when its list of taboo words is so extensive that most players pass on it
IMPLEMENTATION
• Pre-recorded game play– Especially at the beginning, and at quiet times, there
won’t always be players to pair with– In these cases a player is paired against a recorded ‘hand’
of a previous game with the same picture• Cheating
– Players could cheat in a number of ways, including agreeing on labels / playing against themselves
– A number of mechanisms are in place against those cases• Selecting images
SOME STATISTICS
• In the 4 months between August 9th 2003 and December 10th 2003– 13630 players– 1.2 million labels for 293,760 images– 80% of players played more than once
• By 2008: – 200,000 players– 50 million labels
ANALYSIS
• The numbers indicate that the game is fun to play
• Exciting factors:– Playing with a partner– Playing against time
QUALITY OF THE LABELS
• For IMAGE SEARCH:– choose 10 labels among those produced and look at which images
are returned• Compare labels produced by players with labels produced by
participants in an experiment– 15 participants, 20 images among the 1000 with more than 5
labels– 83% of game labels also produced by participants
• Manual assessment of labels (‘would you use these labels to describe this image?’)– 15 participants, 20 images– 85% of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
VERBOSITY
• … or, the game approach to collecting commonsense knowledge
• Motivation: slow progress both on CYC (5 million facts collected) and on Open Mind Commonsense (around 700,000 facts)
THE GAME
• Based on an existing game, TABOO:– Players have to guess a word– One of the players gives hints concerning the word
• In Verbosity, you have two players, the DESCRIBER and the GUESSER, and a SECRET WORD
THE GAME
TEMPLATES IN VERBOSITY
• As in Open Mind Commonsense, templates are used to ensure that the relations / properties of interest are collected
• The Describer produces hints by filling in a template
GUESSING ATTRIBUTES
PRODUCING A DESCRIPTION
TEMPLATES
• _ is a kind of _• _ is used for _• _ is typically near/in/on _• _ is the opposite of _ / _ is related to _
EMULATION
• As in ESP game, pre-recorded games are used when a player cannot be paired with another player
• The asymmetry of the game causes a problem not encountered in ESP game– Describer: can just repeat behavior of previous
describer– Guesser: not so easy
RESULTS
• Only published results I’m aware of predate the actual release of the game so I don’t know about the QUANTITY
• Quality:– Ask six raters whether 200 facts collected using
Verbosity are ‘true’– Around 85% success
PEEKABOOM
• Objective: collect data about the presence of objects in images in order to train vision algorithms for object detection
THE GAME
• Two players• They take turns at playing ‘Peek’ and ‘Boom’• ‘Boom’ gets a picture with an associated word;
‘Peek’ has to guess what is the associated word
• ‘Boom’ reveals parts of a picture to ‘Peek’ by clicking on it (each click reveals a circular area of 20 pixels of radius)
THE GAME: PEEK
THE GAME
PINGS
HINTS
IMPLEMENTATION
• Images and their labels come from ESP• Cheating:
– Player queue (wait until next ‘matching interval’ – one every 10 seconds – to start playing)
– IP address checks (to make sure players are not paired with themselves)
– Blocking bots: ‘seed images’ (previously annotated) and blacklist
EVALUATION: USER STATISTICS
• Usage:– 1 month in 2005 – 14,153 players– 1,122,998 completed rounds– Average person played around 158 images (or 72
minutes)
EVALUATION: ACCURACY OF DATA
• Accuracy of bounding boxes– Choose 50 images played by at least two pairs– Have four volunteers make bounding boxes– OVERLAP(A,B) = AREA(A∩B) / AREA(A B)∪– Average: 0.75
• Accuracy of pings– 50 images as above– Three subject decide if ping is ‘inside the object’– Result: 100%
SOME GENERAL LESSONS
• von Ahn & Dabbish (2008) discuss the general approach and some lessons they took from their work
THREE TEMPLATES
• OUTPUT AGREEMENT GAMES– Generalization of ESP
• INVERSION-PROBLEM GAMES• INPUT-AGREEMENT GAMES
OUTPUT AGREEMENT GAMES
• Two strangers are chosen among all potential players. They cannot see each other or communicate with each other.
• In each round, both are given the same input• Game instructions say that players should
produce same output as their partners• Winning condition: they produce the same
output, possibly after a few attempts
E.g.: ESP GAME.
INVERSION PROBLEM GAMES• Two strangers are chosen among all potential
players. They cannot see each other or communicate with each other.
• In each round, one player is designated as the DESCRIBER whereas the other is designated as the GUESSER. The output from the describer should help the guesser guess the original input
• WINNING CONDITION: The guesser correctly guesses the input originally assigned to the describer.
E.g.: VERBOSITY. Based on ‘20 Questions’.
INPUT AGREEMENT GAMES
• Two strangers are chosen among all potential players. They cannot see each other or communicate with each other.
• In each round, both are given input that is known by the game (but not by the players) to be the same or different
• Game instructions say that players should produce output describing their input so that they can decide whether input is same or different
• Winning condition: playing partners correctly decide whether input is same or different.
E.g.: TagATune.
INCREASE ENJOYMENT
• Games designed so as to make the task enjoyable
• GWAPs by von Ahn et al attempt to do this by giving players a CHALLENGE:– TIMED RESPONSE– SCORE KEEPING– SKILL LEVELS– HIGH SCORE LEVELS
OUTPUT ACCURACY
• Mechanisms to ensure correctness and avoid collusions (e.g., always produce the same label)– Random matching (players don’t know each other’s
identity)– Player testing (assess quality of particular player’s
input by matching his output against already annotated data)
– Repetition (output only considered correct if many players produced it)
– Taboo
MISCELLANEOUS
• Other useful ideas• Evaluation
– Efficiency: THROUGHPUT (T)– ‘Enjoyability’: AVERAGE LIFETIME PLAY (ALP)– Combined measure:
EXPECTED CONTRIBUTION = T * ALP
OTHER GAMES
• On gwap.com– TagATune
• Elsewhere:– FoldIt– Karaoke Callout– PheTch– Spectral Game
FOLDIT
THE PROBLEM: PROTEIN FOLDING
Petsko G.A., Ringe, D., Protein Structure and Function 2004, figure 5-5, pg. 173.
REPRESENTING PROTEIN STRUCTUREWire diagram Ribbon diagram Ball & stick of
featured area
Space filling:van der Waals
Surface representation (GRASP image)
Blue: positiveRed: negative
THE GAME
54
INTRO: https://www.youtube.com/watch?v=bo99JjnfdA8DETAILED EXAMPLE: https://www.youtube.com/watch?v=lGYJyur4FUA
EVALUATION
PROBLEMS SOLVED BY FOLDIT PLAYERS
56
GWAPs for NLP
• Lexical Resource Creation:– (Verbosity)– Jeux de Mots– Groningen Meaning Bank
• Corpus annotation:– The GIVE challenge– Phatris– Phrase Detectives (next lecture)– The sentiment game
JEUX DE MOTS
JEUX DE MOTS
• A game to acquire a ‘lexical-semantic network’: a knowledge base with information about– Concepts– Their lexical associations– Their conceptual relations (ISA, PART-OF, etc)
• Developed by Mathieu Lafourcade• Since 2007
BASICS
• A two-player game• The players do not know each other (as in
Verbosity etc)
ENTERING LEXICAL ASSOCIATIONS
Target word+ instructions
player 1 player 2
propositions
Target word+ instructions
propositions
=
intersection
Game playSCORING
Game play
Target word+ instructions
player 1 player 2
propositions
Target word+ instructions
propositions
=
intersection
accordance
SCORING
Game play
Mot cible+ consigne
player 1 player 2
propositions
Mot cible+ consigne
propositions
=
intersection
accordance
Reward
SCORING
RESULTS OF A GAME
RESULTS SO FAR
• 1,375,432 games played since 2007– Over 9 million relations entered
• Results of game(s): dictionary called DIKO
THE GIVE CHALLENGE
THE GIVE CHALLENGE
• Generating Instructions in Virtual Environments
• A shared task for the NLG community• Users evaluate systems by playing a game in
which the instructions are generated by NLG systems
REFERENCES
• L. von Ahn and L. Dabbish (2008). Designing games with a purpose. Communications of the ACM, v. 51, n.8, 58-67
• L. von Ahn and L. Dabbish (2004). Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 319–326.
• von Ahn, L., Liu, R., and Blum, M. (2006). Peekaboom. A Game for locating objects in images. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 55–64.
• www.gwap.com• Luis von Ahn’s talk on Human Computation at Google talks