crowdsourcing ambiguity aware ground truth - collective intelligence 2017

Crowdsourcing Ambiguity-Aware Ground Truth

Chris Welty, Anca Dumitrache, Oana Inel,Benjamin Timmermans, Lora Aroyo

June 15th, 2017Collective Intelligence Conference

• rather than accepting disagreement as a natural property of semantic interpretation

• traditionally, disagreement is considered a measure of poor quality in the annotation task because:– task is poorly defined or

– annotators lack training

This makes the elimination of disagreement a goal

What if it is GOOD?

Crowdsourcing Myth: Disagreement is Bad

• typically annotators are asked whether a binary property holds for each example

• often not given a chance to say that the property may partially hold, or holds but is not clearly expressed

• mathematics of using ground truth treats every example the same – either match correct result or not

• poor quality examples tend to generate high disagreement

• disagreement allows us to weight sentences, giving us the ability to both train and evaluate a machine in a more flexible way

Crowdsourcing Myth: All Examples Are Created Equal

What if they are DIFFERENT?

Related Work on Annotating Ambiguity

http://CrowdTruth.org

● Juergens (2013): For word-sense disambiguation, the crowd with ambiguity modeling was able to achieve expert-level quality of annotations.

● Cheatham et al. (2014): Current benchmarks in ontology alignment and evaluation are not designed to model uncertainty caused by disagreement between annotators, both expert and crowd.

● Plank et al. (2014): For part-of-speech tagging, most inter-annotator disagreement pointed to debatable cases in linguistic theory.

● Chang et al. (2017): In a workflow of tasks for collecting and correcting labels for text and images, found that many ambiguous cases cannot be resolved by better annotation guidelines or through worker quality control.

http://crowdtruth.org






• Annotator disagreement is signal, not noise.

• It is indicative of the variation in human semantic interpretation of signs

• It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality

CrowdTruthhttp://CrowdTruth.org



Medical Relation Extraction

1. workers select the medical relations that hold between the given 2 terms in a sentence

2. closed task - workers pick from given set of 14 top UMLS relations

3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction

4. 15 workers /sentence

Twitter Event Extraction

1. workers select the events expressed in the tweet

2. closed task - workers pick from set of 8 events

3. 2,019 English tweets, crawled based on event hashtags

4. 7 workers /tweet (*) no expert annotators

News Events Identification

1. workers highlight words/phrases that describe an event in the sentence

2. open-ended task - workers can pick any words

3. 200 sentences from English TimeBank corpus


Sound Interpretation1. workers give tags that describe a

sound2. open-ended task - workers can pick

any tag3. 284 sounds from Freesound

database4. 10 workers /sentence










Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1

Is ACUTE FEVER – related to → INFLUENZA AH1N1?







News Event Identification








Sound Interpretation









1. workers select the medical relations that hold between the given 2 terms in a sentence

2. closed task - workers pick from given set of 14 top UMLS relations

3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction



1. workers select the events expressed in the tweet

2. closed task - workers pick from set of 8 events

3. 2,019 English tweets, crawled based on event hashtags

4. 7 workers /tweet(*) no expert annotators

News Events Identification

1. workers highlight words/phrases that describe an event in the sentence

2. open-ended task - workers can pick any words

3. 200 sentences from English TimeBank corpus


Sound Interpretation1. workers give tags that describe a

sound2. open-ended task - workers can pick

any tag3. 284 sounds from Freesound

database4. 10 workers /sentence








1 1 1

Worker Vector - Closed Task









1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Media Unit Vector - Closed Task








Unclear relationship between the two arguments reflected in the disagreement









Clearly expressed relation between the two arguments reflected in the agreement











The tension building between China, Japan, U.S., and Vietnam is reaching new heights this week - let's stay tuned! #APAC #powerstruggle

Which of the following EVENTS can you identify in the tweet?

4 0 0 0 6 0 0 0 2

islands disputed between China

and Japan

anti China protests in

Vietnam

None

Unclear description of events in tweetreflected in the disagreement









RT @jc_stubbs: Another tragic day in #Ukraine - More than 50 rebels killed as new leader unleashes assault: http://t.co/wcfU3kyAFX

Which of the following EVENTS can you identify in the tweet?

0 6 0 0 0 0 0 0 1

Ukraine crisis 2014

None

Clear description of events in tweetreflected in the agreement







Media Unit Vector - Open Task


Worker 1: balloon exploding

Worker 2: gun

Worker 3: gunshot

Worker 4: loud noise, pop

Worker 5: gun, balloon2 3 1 1

balloon

gun, gun

shot

loud noise

pop

+ clustering(e.g. word2vec)


Multiple interpretations of the sound reflected in the disagreement









Worker 1: siren

Worker 2: police siren

Worker 3: siren, alarm

Worker 4: siren

Worker 5: annoying4 1 1

siren

alarm

ann

oying

+ clustering(e.g. word2vec)

Clear meaning of the soundreflected in the agreement









Other than usage in business, Internet technology is also beginning to infiltrate the lifestyle domain .

HIGHLIGHT the words/phrases that refer to an EVENT.

5 4 2 2 4 5 3

Internet

technology

is also

beginning

infiltrate

lifestyle

1

business

3

usage

0

Other

3

domain

Unclear specification of events in sentencereflected in the disagreement







Clear specification of events in sentencereflected in the agreement



Most of the city's monuments were destroyed including a magnificent tiled mosque which dominated the skyline for centuries.

HIGHLIGHT the words/phrases that refer to an EVENT.

5 12 0 1 1 1 4 1 0

were

destroyed

including

magnificent

tiled

mosque

dominated

skyline

centuries

5

monum

ents

3

city

1

Most







Measures how clearly a media unit expresses an annotation

1

0 1 1 0 0 4 3 0 0 5 1 0

Unit vector for annotation A6

Media Unit VectorCosine = .55

Media Unit - Annotation Score








• Goal: what is the most accurate data labeling method?

○ CrowdTruth: media unit-annotation score (continuous value)

○ Majority Vote: decision of majority of workers (discrete)

○ Single: decision of a single worker, randomly sampled (discrete)

○ Expert: decision of domain expert (discrete)

• Approach: evaluate with trusted labels set with either:

○ agreement between crowd and expert, or○ manual evaluation when disagreement

Experimental Setup




Evaluation: F1 score

Evaluation: F1 score

CrowdTruth performs better than Majority Vote, at least as well as Expert.

Each task has different best thresholds in the media unit - annotation score.

Evaluation: number of workers

Evaluation: number of workers

Each task reaches stable F1 for different number of workers.

Majority Vote never beats CrowdTruth.

Sound Interpretation needs more workers.

• CrowdTruth performs just as well as domain experts • crowd is also cheaper• crowd is always available

• capturing ambiguity is essential:• majority voting discards

important signals in the data

• using only a few annotators for ground truth is faulty• optimal number of workers /

media unit is task dependent

Experimentsproved that:




CrowdTruth.org

Dumitrache et al.: Empirical Methodology for Crowdsourcing Ground Truth. Semantic Web Journal Special Issue on Human Computation and Crowdsourcing (HC&C) in the Context of the Semantic Web (in

review).

Ambiguity in Crowdsourcing


Media Unit

Annotation Worker







Ambiguity in Crowdsourcing


Media Unit

Annotation Worker

Disagreement can indicate ambiguity!







crowdsourcing ambiguity aware ground truth - collective intelligence 2017

Technology