silviu paun - comparing bayesian models of annotationsilviupaun.com/files/emnlp_presentation.pdf ·...

Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London

2Department of Statistics, Columbia University

3School of Computer Science and Electronic Engineering, University of Essex

4Department of Marketing, Bocconi University

Introduction

• Crowdsourcing is increasingly used as an alternative to traditional

expert annotation

• Crowdsourced annotations require aggregation methods

• Previous methods of analysis include majority voting aggregation and

agreement statistics

2Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Introduction

• Probabilistic models of annotation can solve many of the problems

faced by previous practices

• A large number of models has been proposed

• The literature comparing models of annotation is limited


Introduction

• We compare six existing models of annotation with distinct prior and

likelihood structures and a diverse set of effects

• The evaluation is done in both gold dependent and independent

settings

• We use three standard NLP datasets for testing crowdsourcing

• We also include an additional dataset produced using a game with a purpose


Reviewed models

• A pooled model: all annotators share the same ability

• Ability parameterized in terms of a confusion matrix (the Multinomial model)

• Unpooled models: each annotator has their own response parameter

• Different annotator ability structures

• Confusion matrix (the Dawid and Skene model)

• Credibility and spamming preference (the MACE model)


Reviewed models

• Partially-pooled models: assume both individual and hierarchical

structure

• Partial pooling across annotators (the Hierarchical Dawid and Skene model)

• Partial pooling across item difficulties (the Item Difficulty model)

• Partial pooling across annotators and item difficulties (the Logistic Random

Effects model)


The “Logistic Random Effects” model

JK

βj,k

π

INi

yi,n

ci

K

ζk

Ωk θiK

Χk


The data used in the evaluation

• Datasets produced using a standard crowdsourcing

platform

• The Snow et. al (2008) datasets

• Used often in the crowdsourcing literature

• Data produced using a game with a purpose

• The Phrase Detectives (Chamberlain et al., 2016) corpus

• Less artificial, i.e., more variation


Dataset Statistics

(min, median, max)

Comparison against a gold standard


• The models which assume some form of

annotator structure got the best results

• Having a richer annotator structure can be

more beneficial

• Ignoring the annotator structure

generally leads to poor results

• Unless the data is produced by annotators

of a similar behavior

R T E d a t a s e t

P D d a t a s e t

Predictive accuracy evaluation


• Ambiguity can affect the reliability of gold

standard datasets

• Posterior predictions are a standard assessment

method for Bayesian models

• We measure the predictive performance of

each model using the log predictive density

(lpd), in a Bayesian K-fold cross-validation

setting

Predictive accuracy evaluation


• Generally, the models which assume some form of

annotator structure got the best results

• In particular, it’s the partially pooled models which are

most consistent

• The unpooled models are prone to overfitting

• Ignoring the annotator structure leads to poor

predictive performance results

• Except for the WSD dataset where all annotators are

highly proficient (above 95% accuracy)

Results

An analysis of different player types


• The PD corpus comes with a list of spammers and one of good, established players

A typical “spammer” A typical “good player”

Take away points


• Majority voting and agreement statistics lead to biased estimates

• Probabilistic models of annotation address these problems

• Model different effects (annotator accuracy and bias, item difficulty)

• Best architecture: partially pooled models with annotator structure

Thank you!

Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London

2Department of Statistics, Columbia University

3School of Computer Science and Electronic Engineering, University of Essex

4Department of Marketing, Bocconi University

Technical notes

• Non-centred parameterizations

• Problem: in hierarchical models a complicated posterior curvature increases the difficulty of

the sampling process (sparse data or large inter-group variances)

• Solution: separate the local parameters from their parents

• Label-switching

• Problem: refers to likelihood’s invariance under the permutation of the labels

• Occurs in mixture models; makes the models non-identifiable

• Solution: gold alignment


silviu paun - comparing bayesian models of annotationsilviupaun.com/files/emnlp_presentation.pdf ·...

Documents