silviu paun - comparing bayesian models of annotationsilviupaun.com/files/emnlp_presentation.pdf ·...
TRANSCRIPT
Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1
1School of Electronic Engineering and Computer Science, Queen Mary University of London
2Department of Statistics, Columbia University
3School of Computer Science and Electronic Engineering, University of Essex
4Department of Marketing, Bocconi University
Introduction
• Crowdsourcing is increasingly used as an alternative to traditional
expert annotation
• Crowdsourced annotations require aggregation methods
• Previous methods of analysis include majority voting aggregation and
agreement statistics
2Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
Introduction
• Probabilistic models of annotation can solve many of the problems
faced by previous practices
• A large number of models has been proposed
• The literature comparing models of annotation is limited
3Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
Introduction
• We compare six existing models of annotation with distinct prior and
likelihood structures and a diverse set of effects
• The evaluation is done in both gold dependent and independent
settings
• We use three standard NLP datasets for testing crowdsourcing
• We also include an additional dataset produced using a game with a purpose
4Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
Reviewed models
• A pooled model: all annotators share the same ability
• Ability parameterized in terms of a confusion matrix (the Multinomial model)
• Unpooled models: each annotator has their own response parameter
• Different annotator ability structures
• Confusion matrix (the Dawid and Skene model)
• Credibility and spamming preference (the MACE model)
5Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
Reviewed models
• Partially-pooled models: assume both individual and hierarchical
structure
• Partial pooling across annotators (the Hierarchical Dawid and Skene model)
• Partial pooling across item difficulties (the Item Difficulty model)
• Partial pooling across annotators and item difficulties (the Logistic Random
Effects model)
6Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
The “Logistic Random Effects” model
JK
βj,k
π
INi
yi,n
ci
K
ζk
Ωk θiK
Χk
7Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
The data used in the evaluation
• Datasets produced using a standard crowdsourcing
platform
• The Snow et. al (2008) datasets
• Used often in the crowdsourcing literature
• Data produced using a game with a purpose
• The Phrase Detectives (Chamberlain et al., 2016) corpus
• Less artificial, i.e., more variation
8Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
Dataset Statistics
(min, median, max)
Comparison against a gold standard
9Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
• The models which assume some form of
annotator structure got the best results
• Having a richer annotator structure can be
more beneficial
• Ignoring the annotator structure
generally leads to poor results
• Unless the data is produced by annotators
of a similar behavior
R T E d a t a s e t
P D d a t a s e t
Predictive accuracy evaluation
10Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
• Ambiguity can affect the reliability of gold
standard datasets
• Posterior predictions are a standard assessment
method for Bayesian models
• We measure the predictive performance of
each model using the log predictive density
(lpd), in a Bayesian K-fold cross-validation
setting
Predictive accuracy evaluation
11Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
• Generally, the models which assume some form of
annotator structure got the best results
• In particular, it’s the partially pooled models which are
most consistent
• The unpooled models are prone to overfitting
• Ignoring the annotator structure leads to poor
predictive performance results
• Except for the WSD dataset where all annotators are
highly proficient (above 95% accuracy)
Results
An analysis of different player types
12Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
• The PD corpus comes with a list of spammers and one of good, established players
A typical “spammer” A typical “good player”
Take away points
13Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio
• Majority voting and agreement statistics lead to biased estimates
• Probabilistic models of annotation address these problems
• Model different effects (annotator accuracy and bias, item difficulty)
• Best architecture: partially pooled models with annotator structure
Thank you!
Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1
1School of Electronic Engineering and Computer Science, Queen Mary University of London
2Department of Statistics, Columbia University
3School of Computer Science and Electronic Engineering, University of Essex
4Department of Marketing, Bocconi University
Technical notes
• Non-centred parameterizations
• Problem: in hierarchical models a complicated posterior curvature increases the difficulty of
the sampling process (sparse data or large inter-group variances)
• Solution: separate the local parameters from their parents
• Label-switching
• Problem: refers to likelihood’s invariance under the permutation of the labels
• Occurs in mixture models; makes the models non-identifiable
• Solution: gold alignment
15Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio