crowdsourcing ambiguity aware ground truth - collective intelligence 2017

32
Crowdsourcing Ambiguity-Aware Ground Truth Chris Welty , Anca Dumitrache, Oana Inel, Benjamin Timmermans, Lora Aroyo June 15th, 2017 Collective Intelligence Conference

Upload: lora-aroyo

Post on 21-Jan-2018

3.973 views

Category:

Technology


0 download

TRANSCRIPT

Crowdsourcing Ambiguity-Aware Ground Truth

Chris Welty, Anca Dumitrache, Oana Inel,Benjamin Timmermans, Lora Aroyo

June 15th, 2017Collective Intelligence Conference

• rather than accepting disagreement as a natural property of semantic interpretation

• traditionally, disagreement is considered a measure of poor quality in the annotation task because:– task is poorly defined or

– annotators lack training

This makes the elimination of disagreement a goal

What if it is GOOD?

Crowdsourcing Myth: Disagreement is Bad

• typically annotators are asked whether a binary property holds for each example

• often not given a chance to say that the property may partially hold, or holds but is not clearly expressed

• mathematics of using ground truth treats every example the same – either match correct result or not

• poor quality examples tend to generate high disagreement

• disagreement allows us to weight sentences, giving us the ability to both train and evaluate a machine in a more flexible way

Crowdsourcing Myth: All Examples Are Created Equal

What if they are DIFFERENT?

Related Work on Annotating Ambiguity

http://CrowdTruth.org

● Juergens (2013): For word-sense disambiguation, the crowd with ambiguity modeling was able to achieve expert-level quality of annotations.

● Cheatham et al. (2014): Current benchmarks in ontology alignment and evaluation are not designed to model uncertainty caused by disagreement between annotators, both expert and crowd.

● Plank et al. (2014): For part-of-speech tagging, most inter-annotator disagreement pointed to debatable cases in linguistic theory.

● Chang et al. (2017): In a workflow of tasks for collecting and correcting labels for text and images, found that many ambiguous cases cannot be resolved by better annotation guidelines or through worker quality control.

• Annotator disagreement is signal, not noise.

• It is indicative of the variation in human semantic interpretation of signs

• It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality

CrowdTruthhttp://CrowdTruth.org

Medical Relation Extraction

1. workers select the medical relations that hold between the given 2 terms in a sentence

2. closed task - workers pick from given set of 14 top UMLS relations

3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction

4. 15 workers /sentence

Twitter Event Extraction

1. workers select the events expressed in the tweet

2. closed task - workers pick from set of 8 events

3. 2,019 English tweets, crawled based on event hashtags

4. 7 workers /tweet (*) no expert annotators

News Events Identification

1. workers highlight words/phrases that describe an event in the sentence

2. open-ended task - workers can pick any words

3. 200 sentences from English TimeBank corpus

4. 15 workers /sentence

Sound Interpretation1. workers give tags that describe a

sound2. open-ended task - workers can pick

any tag3. 284 sounds from Freesound

database4. 10 workers /sentence

http://CrowdTruth.org

Medical Relation Extraction

http://CrowdTruth.org

Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1

Is ACUTE FEVER – related to → INFLUENZA AH1N1?

Medical Relation Extraction

1. workers select the medical relations that hold between the given 2 terms in a sentence

2. closed task - workers pick from given set of 14 top UMLS relations

3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction

4. 15 workers /sentence

Twitter Event Extraction

1. workers select the events expressed in the tweet

2. closed task - workers pick from set of 8 events

3. 2,019 English tweets, crawled based on event hashtags

4. 7 workers /tweet(*) no expert annotators

News Events Identification

1. workers highlight words/phrases that describe an event in the sentence

2. open-ended task - workers can pick any words

3. 200 sentences from English TimeBank corpus

4. 15 workers /sentence

Sound Interpretation1. workers give tags that describe a

sound2. open-ended task - workers can pick

any tag3. 284 sounds from Freesound

database4. 10 workers /sentence

http://CrowdTruth.org

1 1 1

Worker Vector - Closed Task

http://CrowdTruth.org

Medical Relation Extraction

1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Media Unit Vector - Closed Task

http://CrowdTruth.org

Unclear relationship between the two arguments reflected in the disagreement

Medical Relation Extraction

http://CrowdTruth.org

Clearly expressed relation between the two arguments reflected in the agreement

Medical Relation Extraction

http://CrowdTruth.org

Twitter Event Extraction

http://CrowdTruth.org

The tension building between China, Japan, U.S., and Vietnam is reaching new heights this week - let's stay tuned! #APAC #powerstruggle

Which of the following EVENTS can you identify in the tweet?

4 0 0 0 6 0 0 0 2

islands disputed between China

and Japan

anti China protests in

Vietnam

None

Unclear description of events in tweetreflected in the disagreement

Twitter Event Extraction

http://CrowdTruth.org

RT @jc_stubbs: Another tragic day in #Ukraine - More than 50 rebels killed as new leader unleashes assault: http://t.co/wcfU3kyAFX

Which of the following EVENTS can you identify in the tweet?

0 6 0 0 0 0 0 0 1

Ukraine crisis 2014

None

Clear description of events in tweetreflected in the agreement

Media Unit Vector - Open Task

http://CrowdTruth.org

Worker 1: balloon exploding

Worker 2: gun

Worker 3: gunshot

Worker 4: loud noise, pop

Worker 5: gun, balloon2 3 1 1

balloon

gun, gun

shot

loud noise

pop

+ clustering(e.g. word2vec)

Sound Interpretation

Multiple interpretations of the sound reflected in the disagreement

Sound Interpretation

http://CrowdTruth.org

Worker 1: siren

Worker 2: police siren

Worker 3: siren, alarm

Worker 4: siren

Worker 5: annoying4 1 1

siren

alarm

ann

oying

+ clustering(e.g. word2vec)

Clear meaning of the soundreflected in the agreement

News Event Identification

http://CrowdTruth.org

Other than usage in business, Internet technology is also beginning to infiltrate the lifestyle domain .

HIGHLIGHT the words/phrases that refer to an EVENT.

5 4 2 2 4 5 3

Internet

technology

is also

beginning

infiltrate

lifestyle

1

business

3

usage

0

Other

3

domain

Unclear specification of events in sentencereflected in the disagreement

Clear specification of events in sentencereflected in the agreement

News Event Identification

http://CrowdTruth.org

Most of the city's monuments were destroyed including a magnificent tiled mosque which dominated the skyline for centuries.

HIGHLIGHT the words/phrases that refer to an EVENT.

5 12 0 1 1 1 4 1 0

were

destroyed

including

magnificent

tiled

mosque

dominated

skyline

centuries

5

monum

ents

3

city

1

Most

Measures how clearly a media unit expresses an annotation

1

0 1 1 0 0 4 3 0 0 5 1 0

Unit vector for annotation A6

Media Unit VectorCosine = .55

Media Unit - Annotation Score

http://CrowdTruth.org

• Goal: what is the most accurate data labeling method?

○ CrowdTruth: media unit-annotation score (continuous value)

○ Majority Vote: decision of majority of workers (discrete)

○ Single: decision of a single worker, randomly sampled (discrete)

○ Expert: decision of domain expert (discrete)

• Approach: evaluate with trusted labels set with either:

○ agreement between crowd and expert, or○ manual evaluation when disagreement

Experimental Setup

http://CrowdTruth.org

Evaluation: F1 score

Evaluation: F1 score

CrowdTruth performs better than Majority Vote, at least as well as Expert.

Each task has different best thresholds in the media unit - annotation score.

Evaluation: number of workers

Evaluation: number of workers

Each task reaches stable F1 for different number of workers.

Majority Vote never beats CrowdTruth.

Sound Interpretation needs more workers.

• CrowdTruth performs just as well as domain experts • crowd is also cheaper• crowd is always available

• capturing ambiguity is essential:• majority voting discards

important signals in the data

• using only a few annotators for ground truth is faulty• optimal number of workers /

media unit is task dependent

Experimentsproved that:

http://CrowdTruth.org

CrowdTruth.org

Dumitrache et al.: Empirical Methodology for Crowdsourcing Ground Truth. Semantic Web Journal Special Issue on Human Computation and Crowdsourcing (HC&C) in the Context of the Semantic Web (in

review).

Ambiguity in Crowdsourcing

http://CrowdTruth.org

Media Unit

Annotation Worker

Ambiguity in Crowdsourcing

http://CrowdTruth.org

Media Unit

Annotation Worker

Ambiguity in Crowdsourcing

http://CrowdTruth.org

Media Unit

Annotation Worker

Disagreement can indicate ambiguity!