evaluation in nlp zdeněk Žabokrtský. intro the goal of nlp evaluation is to measure one or more...
TRANSCRIPT
![Page 1: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/1.jpg)
Evaluation in NLP
Zdeněk Žabokrtský
![Page 2: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/2.jpg)
Intro
The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system
Definition of proper evaluation criteria is one way to specify precisely an NLP problem
Need for evaluation metrics and evaluation data (language data resources).
![Page 3: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/3.jpg)
Automatic vs. manual evaluation
automatic– comparing the system's output with the gold standard
output – the cost of producing the gold standard data... – ... but then easily repeatable without additional cost
manual– for many NLP problems, the definition of a gold
standard can prove impossible (e.g., when inter-annotator agreement is insufficient)
– manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, based on a number of criteria
![Page 4: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/4.jpg)
Intrinsic vs. extrinsic evaluation
Intrinsic evaluation– considers an isolated NLP system and
characterizes its performance mainly with respect to a gold standard result.
Extrinsic evaluation– considers the NLP system as a component in
a more complex setting.
![Page 5: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/5.jpg)
Lower and upper bounds
Naturally, the performance of our system is expected to be inside the interval given by
– lower bound - result of a baseline solution (less complex or even trivial system the performance of which is supposed to be easily surpassed)
– upper bound - inter-annotator agreement
![Page 6: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/6.jpg)
Evaluation metrics
the simplest case:
– if the number of task instances is known– if the system gives exactly one answer for
each instance– if there is exactly one possible clearly correct
answer for each instance– if all errors are equally wrong
But what if not?
%100instances all
instances predictedcorrectly Accuracy
![Page 7: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/7.jpg)
Precision and recall
But– we are in 2D now – new issue: precision-recall tradeoff
%100given answers
given answerscorrect Precision
%100possible answerscorrect
given answerscorrect Recall
![Page 8: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/8.jpg)
F-measure
one way how to get back to 1D weighted harmonic mean of P and R
usually evenly weighted
RP
PRF
2
2 1
RP
PRF
2
![Page 9: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/9.jpg)
Evaluation in classification tasks
confusion matrix
![Page 10: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/10.jpg)
Evaluationin phrase-structure parsing
nontrivial, because– number added nonterminals is not known in advance– it is not clear what should be treated as the (atomic)
task instance
GEIG metric– Grammar Evaluation Interest Group; used in Parseval– counts the proportion of bracketings which group the
same sequences of words in both trees
LA metric– leaf-ancestor metric - similarity in sequences of node
labels along the paths from terminal nodes to tree root.
![Page 11: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/11.jpg)
Evaluationin dependency parsing
straighforward delimitatation of problem instance -- no nonterminal nodes are added
unlabeled accuracy– proportion of correctly attached nodes (nodes with
correctly predicted parent)
labeled accuracy– proportion of nodes which are correctly attached and
whose labels (dependency relation) are correct too
![Page 12: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/12.jpg)
Evaluation in automatic speech recognition
not only the recognized words might be incorrect, but the number of recognized words might be different too
WER - Word Error Rate
%100anscriptcorrect tr in wordstotal
onssubstituti deletions insertionsWER
![Page 13: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/13.jpg)
Evaluationin machine translation
obviously highly non-trivial– there are always more translations possible– there are always more criteria to be judged
(translation fidelity, grammatical/pragmatic/stylistic correctness...?)
– the essence of translation -- transfering the same meaning from one natural language to annother -- cannot be evaluated by the contemporary machines at all !!!
current approaches– either to use human judges– or to use reference (human-made) translations and
string-wise comparison metrics which are hoped to correlate with human judgement: BLEU, NIST, METEOR ...
![Page 14: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/14.jpg)
BLEU zkratka?
N - maximal considered n-gram length (usually 4) pn - precision on n-gram using (a set of) reference
translation(s) wn - positive weight (typically 1/N) BP - brevity penalty (to compensate easier n-gram
precision on shorter candidate sentences):
r - length of reference translation c - length of candidate translation
N
nnn pwBPBLEU
1logexp
c
rBP 1exp,1min
![Page 15: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/15.jpg)
Interannotator agreement IAA - measure to tell how good human experts
perform when given a specific task (to measure the reliability of manual annotations)
e.g. F-measure on data from two annotators (one of them virtually treated as gold standard, symmetric if F1)
But: nonzero value is obtained even if annotators' decisions are uncorrelated
example– two annotators making classifications into two classes– 1st annotator: 80% A, 20% B– 2nd annotator 85% A, 15% B– probability of agreement by chance: 0.8*0.85 + 0.2*0.15
= 71%
desired measure: 1 if they agree in all decisions, 0 if their agreement is equal to agreement by chance
![Page 16: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/16.jpg)
Cohen's Kappa
takes into account the agreement occurring by chance
Pa - relative observed agreement between annotators
Pe - probability of agreement by chance
but kappa -- as a means for quantifying actual level of agreement -- is still a source of much controversy
e
ea
P
PP
1
![Page 17: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/17.jpg)
Evaluation rounding Number of significant digits is linked to experiment setting and
reflects its result uncertainty.
Writing more digits in an answer than justified by the number of digits in the data is bad. Do not say that error rate of your system is 42.8571% if it has made 3 errors in 7 task instances (superfluous precision).
Basic rules for rounding:– Multiplication/division - the number of significant digits in an
answer should equal the least number of significant digits in any one of the numbers being multiplied/divided.
– Addition/subtraction - the number of decimal places (not significant digits) in the answer should be the same as the least number of decimal places in any of the numbers being added or subtracted.
But: the number of significant digits in a value provides only a very rough indication of its precision -- better to use confidence interval (e.g. 3.28±.05) at certain probability level (typically at 95%).
![Page 18: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/18.jpg)
Towardsmore robust evaluation
K-fold cross validation (usually K=10):
– 1) partition the data into K roughly equally sized
subsamples
– 2) perform cyclically K iterations
• use K-1 subsamples for training
• use 1 subsample for testing
– 3) average the iterations' results
more reliable results, especially if you have only
small or in some sense non-uniform data
![Page 19: Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper](https://reader035.vdocuments.us/reader035/viewer/2022072011/56649e3b5503460f94b2d1f2/html5/thumbnails/19.jpg)
"Shared Tasks" in NLP contests in implementing systems for a specified task some of them quite popular in the NLP community (e.g. CoNLL) conditioned by existence of training and evaluation data and of
evaluation metrics Examples:
– Message Understanding Conferences (MUCs)• 1980-1990's, information retrieval, named entity recognition
– Parseval• 1991, phrase-structure parsing
– Senseval, Semeval• word sense disambiguation
– WMT Shared Task• ACL Workshop in machine translation, 2006-2008
– CoNLLL Shared Task• Conference on Computational Natural Language Learning• named entity resolution (2003)• semantic role labeling (2004,2005)• multilingual dependency parsing (2006, 2007)• Joint Parsing of Syntactic and Semantic Dependencies - 2008