crowdsourcing massimo poesio part 4: dealing with crowdsourced data

CROWDSOURCING

Massimo PoesioPart 4: Dealing with crowdsourced

data

THE DATA

• The result of crowdsourcing in whatever form is a mass of often inconsistent judgments

• Need techniques for identifying reliable annotations and reliable annotators– In the Phrase Detectives context, to discriminate

between genuine ambiguity and disagreements due to error

THE ANDROCLES EXAMPLE

SOME APPROACHES

• Majority voting• But: it ignores the substantial differences in

behavior between annotators• Alternatives:

– Removing bad annotators eg using clustering– Weighing annotators

SNOW ET AL

SNOW ET AL: WEIGHTING ANNOTATORS

LATENT MODELS OF ANNOTATION QUALITY

• The problem of reaching a conclusion on the basis of judgments by separate experts that may often be in disagreement is a longstanding one in epidemiology

• A number of techniques developed, including– Dawid and Skene 1979 (also used by Passonneau &

Carpenter)– Latent Annotation model (Uebersax 1994)– Raykar et al 2010

• Recently, Carpenter has been developing an explicit Hierarchical Bayesian model (2008)

DAWID AND SKENE 1979

• Model consists of likelihood for1. annotations (labels from annotators) 2. categories (true labels) for items given 3. annotator accuracies and biases4. prevalence of labels • Frequentists estimate 2–4 given 1• Optional regularization of estimates (for 3 and

4)

A GENERATIVE MODEL OF THE ANNOTATION TASK

• What all of these models do is to provide an EXPLICIT PROBABILISTIC MODEL of the observations in terms of annotators, labels, and items

THE DATA

• K possible labels• J annotators• I number of items • N total number of annotations of the I items

produced by the J annotators• y_{i,j}: label produced for item i by coder j

THE DATA: BY ITEM

ITEM CODER 1 CODER 2 … CODER J

1 y_{1,1} y_{1,2} ..

2 y_{2,1} ..

3

4 y_{4,2} y_{4,J}

…

I y_{I,1} …

THE DATA: BY ANNOTATIONS

ANNOTATION LABEL

1 y_{1,1}

2 y_{1,2}

3 y_{2,3}

4 ..

..

N

THE ANNOTATION TABLE

ANNOTATION Ii_n jj_n y_{ii_n,jj_n}

1 1 1 A

2 1 2 A

3 2 3 B

4 ..

..

N

A GENERATIVE MODEL OF THE ANNOTATION TASK

• The probabilistic model specifies the probability of a particular label on the basis of PARAMETERS specifying the behavior of the annotators, the prevalence of the labels, etc

• In Bayesian models, these parameters are specified in terms of PROBABILITY DISTRIBUTIONS

THE PARAMETERS OF THE MODEL

• z_i: the ACTUAL category of item i• Θ_{j,k,k’}: ANNOTATOR RESPONSE

– the probability that annotator j labels an item as k’ when it belongs to category k

• π_k: PREVALENCE– The probability that an item belongs to category k

DISTRIBUTIONS

• Each of the parameters is characterized in terms of a PROBABILITY DISTRIBUTION

• When we have some information on the data, these distributions can be used to characterize their behavior – E.g., annotators may be all equally good / there

may be a skew• Otherwise just defaults

DISTRIBUTIONS

• Prevalence of labels (PRIOR)– π ~ Dir(α)

• Annotator j’s response to item of category k (PRIOR)– Θ_{j,k} ~ Dir(β_k)

• True category of item i (LIKELIHOOD):– z_i ~ Categorical(π)

• Label from j for item i (LIKELIHOOD):– y_{i,j} ~ Categorical(Θ_{j,z_i})

TYPES OF ANNOTATORS: SPAMMY

(RESPONSE TO ALL ITEMS THE SAME)

TYPES OF ANNOTATORS: BIASED

(HAS SKEW IN RESPONSE – COMMON IN LOW PREVALENCE DATA)

QUICK INTRO TO DIRICHLET

• Dirichlet is often seen in Bayesian models (e.g., Latent Dirichlet Allocation, LDC) because it is a CONJUGATE PRIOR of the MULTINOMIAL distribution

BINOMIAL AND MULTINOMIAL

NLE 24

CONJUGATE PRIOR

• In Bayesian inference the objective is to compute a POSTERIOR on the basis of a LIKELIHOOD and a PRIOR

• A CONJUGATE PRIOR of distribution D is a distribution such that if it is used for the prior, then the posterior also has that shape– E.g., ‘Dirichlet is a conjugate prior of the multinomial’ means that if

the likelihood is a multinomial and the prior is Dirichlet then the posterior is also Dirichlet.

)(

)()|()|(

AP

BPBAPABP

DIRICHLET DISTRIBUTION

CATEGORICAL

• The categorical distribution is a generalization of the Bernoulli distribution that specifies the probability of a given outcome for a binary trial– E.g., the probability of getting a head in a coin toss– Cfr.: BINOMIAL distribution that specifies the

probability of getting N heads

A GRAPHICAL VIEW OF THE MODEL

THE PROBABILISTIC MODEL OF A GIVEN LABEL

AN EXAMPLE

PROBABILISTIC INFERENCE

• Probabilistic inference techniques are used to INFER the parameters from the data and therefore compute the probabilities and parameters– Often: Expectation Maximization (EM)

• The EM implementation in R used by Carpenter & Passonneau to estimate the parameters available from – https://github.com/bob- carpenter/ anno

APPLICATION TO WORD SENSE DISTRIBUTION (CARPENTER & PASSONNEAU, 2013, 2014)

• Carpenter and Passonneau used the Dawid and Skeene model to compare manual annotators with turkers on word sense disambiguation anno of the MASC corpus

THE MASC corpus

• Manually annotated subcorpus (MASC)– 500K word subset of Open American National Corpus

(OANC)• Multiple genres: technical manuals, poetry, news,

dialogue, etc.• 16 types of annotation (not all manual)

– part of speech, phrases, word sense, named entity, ... • 100 item word-sense corpus

– balanced by genre and part-of-speech (noun, verb, adjective)

MASC WORDSENSE

• 100 words balanced between adjs, nouns, & verbs

• 1000 sentences for each word• Annotated using WordNet senses for these

words• ~ 1M tokens

MASC Wordsense: annotation using trained annotators

• pre-training on 50 items• independent labeling of 1000 items • 100 items labeled by 3 or 4 annotators • agreement on these 100 items reported• only single round of annotation, most items

single annotated

Annotation using trained annotators

• College students from Vassar, Barnard, Columbia

• 2–3 years of work on project• General training plus per-word training• Supervised by

– Becky Passonneau– Nancy Ide (maintainer of MASC)– Christiane Fellbaum (maintainer of WordNet)

Annotation using crowdsourcing

• 45 randomly selected words balanced across nouns, verbs, and adjectives were reannotated using crowdsourcing

• 1000 instances per word • 25+ annotators per instance • high number of annotators to

– estimate difficulty– reject independence of labels

Differences from trained situation

• Annotators not trained• Not told to look at WordNet• Each HIT:

– 10 sentences for the same word– WordNet senses listed under the word

METHODS

• Passonneau & Carpenter used their model to– Evaluate prevalence of labels in different ways– Evaluate annotator response

PREVALENCE ESTIMATION

ASSESSMENT OF QUALITY

ANNOTATOR RESPONSE

AGREEMENT RATES

OTHER MODELS

• Raykar et al, 2010• Carpenter, 2008

RAYKAR ET AL 2010

• Simultaneously ESTIMATES THE GROUND TRUTH from noisy labels, produces an ASSESSMENT OF THE ANNOTATORS, and LEARNS A CLASSIFIER– Based on logistic regression

• Bayesian (includes priors on the annotators)

ANNOTATORS

• Annotator j characterized by her/his– SENSITIVITY: the ability to recognize positive cases

• α_j = P(y_j=1|y=1)

– SPECIFICITY: the ability to recognize negative cases• β_j = P(y_j=1|y=1)

www.phrasedetectives.com

Raykar et al propose a version of the EM algorithm that can be used to estimate P(O|θ) as well as sensitivity and specificity for each annotator

€

P O |θ( ) = P y i1...y i

R | x i,θ( )i=1

N

∏

€

P O |θ( ) = [aipi + bii=1

N

∏ (1− pi)]

Carpenter developed a fully Bayesian version of the approach based on gradient descent

RAYKAR ET AL

CARPENTER

DISAGREEMENT IN INTERPRETATION

15.12 M: we’re gonna take the engine E3

15.13 : and shove it over to Corning

15.14 : hook [it] up to [the tanker car]

15.15 : _and_

15.16 : send it back to Elmira

(from the TRAINS-91 dialogues collected at the University of Rochester)

AMBIGUITY: REFERENT

www.phrasedetectives.com

About 160 workers at a factory that made paper for the Kent filters were exposed to asbestos in the 1950s.

Areas of the factory were particularly dusty where the crocidolite was used.

Workers dumped large burlap sacks of the imported material into a huge bin, poured in cotton and acetate fibers and mechanically mixed the dry fibers in a process used to make filters.

Workers described "clouds of blue dust" that hung over parts of the factory,

even though exhaust fans ventilated the area.

AMBIGUITY: REFERENT

AMBIGUITY: EXPLETIVES

'I beg your pardon!' said the Mouse, frowning, but very politely: 'Did you speak?'

'Not I!' said the Lory hastily.

'I thought you did,' said the Mouse. '--I proceed. "Edwin and Morcar,the earls of Mercia and Northumbria, declared for him: and even Stigand,the patriotic archbishop of Canterbury, found it advisable--"'

'Found WHAT?' said the Duck.

'Found IT,' the Mouse replied rather crossly: 'of course you know what"it" means.'

OTHER DATA: WORDSENSE DISAMBIGUATION (Passonneau et al 2010)

And our ideas of what constitutes a FAIR wage on a FAIR return on capital are historically contingent … {sense1, sense1, sense1, sense2, sense2, sense2}

… the federal government … is wrangling for its FAIR share of the dividend … {sense1, sense1, sense2, sense2, sense8, sense8}

OTHER DATA: POS (Plank, Hovy & Søgaard 2014)

Noam goes OUT tonight {ADP, PRT}

Noam likes SOCIAL media {ADJ, NOUN}

REFERENCES

• Passonneau & Carpenter, 2014. The Benefits of a Model of Annotation. TACL. To appear.

• Raykar, Yu, Zhao, Valadez, Florin, Bogoni, & Moy, 2010. Learning from crowds. Journal of Machine Learning Research.

crowdsourcing massimo poesio part 4: dealing with crowdsourced data

Documents

model slide

n slide

coder j slide

j annotators y

error slide

crowdsourced data slide

possible labels j annotators

annotator j labels