Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of [email protected]
EXPERT Winter School, 12 November 2013
Translation Quality Assessment: Evaluation and Estimation 1 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
“Machine Translation evaluation is better understood thanMachine Translation”
(Carbonell and Wilks, 1991)
Translation Quality Assessment: Evaluation and Estimation 2 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 3 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 4 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Why is evaluation important?
Compare MT systems
Measure progress of MT systems over time
Diagnose of MT systems
Assess (and pay) human translators
Quality assurance
Tuning of SMT systems
Decision on fitness-for-purpose
...
Translation Quality Assessment: Evaluation and Estimation 5 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Why is evaluation hard?
What does quality mean?
Fluent?Adequate?Easy to post-edit?
Quality for whom/what?
End-user: gisting (Google Translate), internalcommunications, or publication (dissemination)MT-system: tuning or diagnosisPost-editor: draft translations (light vs heavypost-editing)Other applications, e.g. CLIR
Translation Quality Assessment: Evaluation and Estimation 6 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Why is evaluation hard?
What does quality mean?
Fluent?Adequate?Easy to post-edit?
Quality for whom/what?
End-user: gisting (Google Translate), internalcommunications, or publication (dissemination)MT-system: tuning or diagnosisPost-editor: draft translations (light vs heavypost-editing)Other applications, e.g. CLIR
Translation Quality Assessment: Evaluation and Estimation 6 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation 7 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation 7 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation 7 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation 7 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements onfluency/adequacyTask-based human metrics: productivity tests(HTER, PE time, keystrokes), user-satisfaction, readingcomprehension
Automatic metrics:
Based on human references: BLEU, METEOR, TER, ...Reference-less: quality estimation
Translation Quality Assessment: Evaluation and Estimation 8 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements onfluency/adequacyTask-based human metrics: productivity tests(HTER, PE time, keystrokes), user-satisfaction, readingcomprehension
Automatic metrics:
Based on human references: BLEU, METEOR, TER, ...Reference-less: quality estimation
Translation Quality Assessment: Evaluation and Estimation 8 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 9 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Judgements in an n-point scale
Adequacy using 5-point scale (NIST-like)
5 All meaning expressed in the source fragment appears inthe translation fragment.
4 Most of the source fragment meaning is expressed in thetranslation fragment.
3 Much of the source fragment meaning is expressed in thetranslation fragment.
2 Little of the source fragment meaning is expressed in thetranslation fragment.
1 None of the meaning expressed in the source fragment isexpressed in the translation fragment.
Translation Quality Assessment: Evaluation and Estimation 10 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Judgements in an n-point scale
Fluency using 5-point scale (NIST-like)
5 Native language fluency. No grammar errors, goodword choice and syntactic structure in the translationfragment.
4 Near native language fluency. Few terminology orgrammar errors which don’t impact the overallunderstanding of the meaning.
3 Not very fluent. About half of translation containserrors.
2 Little fluency. Wrong word choice, poor grammar andsyntactic structure.
1 No fluency. Absolutely ungrammatical and for the mostpart doesn’t make any sense.
Translation Quality Assessment: Evaluation and Estimation 11 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?Can we use multiple annotators?
Are fluency and adequacy really separable?
Ref: Absolutely ungrammatical and for the most part doesn’t
make any sense.
MT: Absolutely sense doesn’t ungrammatical for the and most
make any part.
Translation Quality Assessment: Evaluation and Estimation 12 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?Can we use multiple annotators?
Are fluency and adequacy really separable?
Ref: Absolutely ungrammatical and for the most part doesn’t
make any sense.
MT: Absolutely sense doesn’t ungrammatical for the and most
make any part.
Translation Quality Assessment: Evaluation and Estimation 12 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Ranking
WMT-13 Appraise tool: rank translations best-worst (w. ties)
Translation Quality Assessment: Evaluation and Estimation 13 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.
Only serve for comparison purposes - the best systemmight not be good enough
Absolute evaluation can do both
Translation Quality Assessment: Evaluation and Estimation 14 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.
Only serve for comparison purposes - the best systemmight not be good enough
Absolute evaluation can do both
Translation Quality Assessment: Evaluation and Estimation 14 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.
Only serve for comparison purposes - the best systemmight not be good enough
Absolute evaluation can do both
Translation Quality Assessment: Evaluation and Estimation 14 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control ofhuman translation.
E.g.: Multidimensional Quality Metrics (MQM)
Machine and human translation qualityTakes quality of source text into accountActual metric is based on a specification
Translation Quality Assessment: Evaluation and Estimation 15 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control ofhuman translation.
E.g.: Multidimensional Quality Metrics (MQM)
Machine and human translation qualityTakes quality of source text into accountActual metric is based on a specification
Translation Quality Assessment: Evaluation and Estimation 15 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
MQM
Issues selected based on a given specification (dimensions):
Language/locale
Subject field/domain
Text Type
Audience
Purpose
Register
Style
Content correspondence
Output modality, ...
Translation Quality Assessment: Evaluation and Estimation 16 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
MQM
Issue types (core):
Altogether: 120 categories
Translation Quality Assessment: Evaluation and Estimation 17 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
MQM
Issue types: http://www.qt21.eu/launchpad/content/
high-level-structure-0
Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)
translate5: open source graphical (Web) interface for inlineerror annotation: www.translate5.net
Translation Quality Assessment: Evaluation and Estimation 18 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
MQM
Issue types: http://www.qt21.eu/launchpad/content/
high-level-structure-0
Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)
translate5: open source graphical (Web) interface for inlineerror annotation: www.translate5.net
Translation Quality Assessment: Evaluation and Estimation 18 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
MQM
Translation Quality Assessment: Evaluation and Estimation 19 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Error counts
Issues:
Time consuming
Requires training, esp. to distinguish betweenfine-grained error types
Different errors are more relevant for differentspecifications: need to select and weight them accordingly
Translation Quality Assessment: Evaluation and Estimation 20 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 21 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing
Productivity analysis: Measure translation quality withintask. E.g. Autodesk - Productivity test through post-editing
2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment
Translation Quality Assessment: Evaluation and Estimation 22 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing
PET: Records time, keystrokes, edit distance
Translation Quality Assessment: Evaluation and Estimation 23 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing - PET
Translation Quality Assessment: Evaluation and Estimation 24 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing - PET
Translation Quality Assessment: Evaluation and Estimation 25 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing - PET
How often post-editing (PE) a translation tool output isfaster than translating from scratch (HT):
System Faster than HT
Google 94%Moses 86.8%
Systran 81.20%Trados 72.40%
Comparing the time to translate from scratch with the timeto PE MT, in seconds:
Annotator HT (s) PE (s) HT/PE PE/HT
Average 31.89 18.82 1.73 0.59Deviation 9.99 6.79 0.26 0.09
Translation Quality Assessment: Evaluation and Estimation 26 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Post-editing - PET
How often post-editing (PE) a translation tool output isfaster than translating from scratch (HT):
System Faster than HT
Google 94%Moses 86.8%
Systran 81.20%Trados 72.40%
Comparing the time to translate from scratch with the timeto PE MT, in seconds:
Annotator HT (s) PE (s) HT/PE PE/HT
Average 31.89 18.82 1.73 0.59Deviation 9.99 6.79 0.26 0.09
Translation Quality Assessment: Evaluation and Estimation 26 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
Translation Quality Assessment: Evaluation and Estimation 27 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour
Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
Translation Quality Assessment: Evaluation and Estimation 27 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
Translation Quality Assessment: Evaluation and Estimation 27 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Reading comprehension
Defense language proficiency test (Jones et al., 2005):
Translation Quality Assessment: Evaluation and Estimation 28 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Reading comprehension
MT quality: function of
1 Text passage comprehension, as measured by answersaccuracy, and
2 Time taken to complete a test item (read a passage +answer its questions)
Translation Quality Assessment: Evaluation and Estimation 29 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Reading comprehension
MT quality: function of
1 Text passage comprehension, as measured by answersaccuracy, and
2 Time taken to complete a test item (read a passage +answer its questions)
Translation Quality Assessment: Evaluation and Estimation 29 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Reading comprehension
Compared to Human Translation (HT):
Translation Quality Assessment: Evaluation and Estimation 30 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Reading comprehension
Compared to Human Translation (HT):
Translation Quality Assessment: Evaluation and Estimation 31 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Task-based metrics
Issues:
Final goal needs to be very clear
Can be more cost/time consuming
Final task has to have a meaningful metric
Other elements may affect the final quality measurement(e.g. Chinese vs. Americans)
Translation Quality Assessment: Evaluation and Estimation 32 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 33 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Automatic metrics
Compare output of an MT system to one or morereference (usually human) translations: how close is theMT output to the reference translation?
Numerous metrics: BLEU, NIST, etc.
Advantages:Fast and cheap, minimal human labour, no need forbilingual speakers
Once test set is created, can be reused many timesCan be used on an on-going basis during systemdevelopment to test changes
Translation Quality Assessment: Evaluation and Estimation 34 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are required
Very few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations
Translation Quality Assessment: Evaluation and Estimation 35 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differently
Reference translations are only a subset of thepossible good translations
Translation Quality Assessment: Evaluation and Estimation 35 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations
Translation Quality Assessment: Evaluation and Estimation 35 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
String matching
BLEU: BiLingual Evaluation Understudy
Most widely used metric, both for MT systemevaluation/comparison and SMT tuning
Matching of n-grams between MT and Ref: rewardssame words in equal order
#clip(g) count of reference n-grams g which happen in ahypothesis sentence h clipped by the number of times gappears in the reference sentence for h; #(g ′) = numberof n-grams in hypotheses
n-gram precision pn for a set of MT translations H =
pn =
∑h∈H
∑g∈ngrams(h) #clip(g)∑
h∈H∑
g ′∈ngrams(h) #(g ′)
Translation Quality Assessment: Evaluation and Estimation 36 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Combine (mean of the log) 1-n n-gram precisions∑n
log pn
Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference
Compares the overall number of words wh of the entirehypotheses set with ref length wr :
BP =
{1 if wh ≥ wr
e(1−wr/wh) otherwise
BLEU = BP ∗ exp
(∑n
log pn
)
Translation Quality Assessment: Evaluation and Estimation 37 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Combine (mean of the log) 1-n n-gram precisions∑n
log pn
Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference
Compares the overall number of words wh of the entirehypotheses set with ref length wr :
BP =
{1 if wh ≥ wr
e(1−wr/wh) otherwise
BLEU = BP ∗ exp
(∑n
log pn
)
Translation Quality Assessment: Evaluation and Estimation 37 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Combine (mean of the log) 1-n n-gram precisions∑n
log pn
Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference
Compares the overall number of words wh of the entirehypotheses set with ref length wr :
BP =
{1 if wh ≥ wr
e(1−wr/wh) otherwise
BLEU = BP ∗ exp
(∑n
log pn
)Translation Quality Assessment: Evaluation and Estimation 37 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Scale: 0-1, but highly dependent on the test set
Rewards fluency by matching high n-grams (up to 4)
Adequacy rewarded by unigrams and brevity penalty –poor model of recall
Synonyms and paraphrases only handled if they are inany of the reference translations
All tokens are equally weighted: missing out on acontent word = missing out on a determiner
Better for evaluating changes in the same system thancomparing different MT architectures
Translation Quality Assessment: Evaluation and Estimation 38 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Not good at sentence-level, unless smoothing is applied:
Ref: the Iraqi weapons are to be handed over to the armywithin two weeks
MT: in two weeks Iraq’s weapons will give army
1-gram precision: 4/82-gram precision: 1/73-gram precision: 0/64-gram precision: 0/5BLEU = 0
Translation Quality Assessment: Evaluation and Estimation 39 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the armywithin two weeks
Ref2: the Iraqi weapons will be surrendered to the army in twoweeks
MT: the the the theCount for the should be clipped at 2: max count of theword in any reference. Unigram score = 2/4 (not 4/4)
MT: Iraqi weapons will be1-gram precision: 4/42-gram precision: 3/33-gram precision: 2/24-gram precision: 1/1Precision (pn) = 1
Precision score penalised because h < r
Translation Quality Assessment: Evaluation and Estimation 40 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the armywithin two weeks
Ref2: the Iraqi weapons will be surrendered to the army in twoweeks
MT: the the the theCount for the should be clipped at 2: max count of theword in any reference. Unigram score = 2/4 (not 4/4)
MT: Iraqi weapons will be1-gram precision: 4/42-gram precision: 3/33-gram precision: 2/24-gram precision: 1/1Precision (pn) = 1
Precision score penalised because h < r
Translation Quality Assessment: Evaluation and Estimation 40 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
String matching
BLEU:
Translation Quality Assessment: Evaluation and Estimation 41 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance
WER: Word Error Rate:
Levenshtein edit distance
Minimum proportion of insertions, deletions, andsubstitutions needed to transform an MT sentence intothe reference sentence
Heavily penalises reorderings: correct translation in thewrong location: deletion + insertion
WER =S + D + I
N
PER: Position-independent word Error Rate:
Does not penalise reorderings: output and referencesentences are unordered sets
Translation Quality Assessment: Evaluation and Estimation 42 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times
HYP: [this week] the saudis denied information published in the ***** new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4
13= 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation 43 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times
HYP: [this week] the saudis denied information published in the ***** new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4
13= 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation 43 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times
HYP: [this week] the saudis denied information published in the ***** new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4
13= 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation 43 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times
HYP: [this week] the saudis denied information published in the ***** new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4
13= 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation 43 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Edit distance: TER
TER:
Translation Quality Assessment: Evaluation and Estimation 44 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Alignment-based
METEOR:
Unigram Precision and Recall
Align MT output with reference. Take best scoring pairfor multiple refs.
Matching considers word inflection variations (stems),synonyms/paraphrases
Fluency addressed via a direct penalty: fragmentation ofthe matching
METEOR score = F-mean score discounted forfragmentation = F-mean * (1 - DF)
Translation Quality Assessment: Evaluation and Estimation 45 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the armywithin two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation 46 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the armywithin two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation 46 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the armywithin two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation 46 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the armywithin two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation 46 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Others
WMT shared task on metrics:
TerroCat
DepRef
MEANT and TINE
TESLA
LEPOR
ROSE
AMBER
Many other linguistically motivated metrics wherematching is not done at the word-level (only)
...
Translation Quality Assessment: Evaluation and Estimation 47 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 48 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Overview
Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 49 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Framework
MachineLearning
X: examples of source &
translations
QE modelY: Quality scores for
examples in X
Feature extraction
Features
Translation Quality Assessment: Evaluation and Estimation 50 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Framework
MT systemTranslation
for xt'
QE modelQuality score
y'
Features
Feature extraction
SourceText x
s'
Translation Quality Assessment: Evaluation and Estimation 50 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Framework
Main components to build a QE system:
1 Definition of quality: what to predict
2 (Human) labelled data (for quality)
3 Features
4 Machine learning algorithm
Translation Quality Assessment: Evaluation and Estimation 51 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Definition of quality
Predict 1-N absolute scores for adequacy/fluency
Predict 1-N absolute scores for post-editing effort
Predict average post-editing time per word
Predict relative rankings
Predict relative rankings for same source
Predict percentage of edits needed for sentence
Predict word-level edits and its types
Predict BLEU, etc. scores for document
Translation Quality Assessment: Evaluation and Estimation 52 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Datasets
SHEF (several): http://staffwww.dcs.shef.ac.uk/
people/L.Specia/resources.html
LIG (10K, fr-en): http://www-clips.imag.fr/geod/
User/marion.potet/index.php?page=download
LMSI (14K, fr-en, en-fr, 2 post-editors):http://web.limsi.fr/Individu/wisniews/
recherche/index.html
Translation Quality Assessment: Evaluation and Estimation 53 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Features
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Translation Quality Assessment: Evaluation and Estimation 54 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
QuEst
Goal: framework to explore features for QE
Feature extractors for 150+ features of all types: Java
Machine learning: wrappers for a number of algorithmsin the scikit-learn toolkit, grid search, feature selection
Open source:http://www.quest.dcs.shef.ac.uk/
Translation Quality Assessment: Evaluation and Estimation 55 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effortDatasets and language pairs:
Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en
Evaluation metric:
MAE =
∑Ni=1 |H(si)− V (si)|
N
Translation Quality Assessment: Evaluation and Estimation 56 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
State of the art in QE
WMT12-13 shared tasksSentence- and word-level estimation of PE effort
Datasets and language pairs:
Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en
Evaluation metric:
MAE =
∑Ni=1 |H(si)− V (si)|
N
Translation Quality Assessment: Evaluation and Estimation 56 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
State of the art in QE
WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:
Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en
Evaluation metric:
MAE =
∑Ni=1 |H(si)− V (si)|
N
Translation Quality Assessment: Evaluation and Estimation 56 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
State of the art in QE
WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:
Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en
Evaluation metric:
MAE =
∑Ni=1 |H(si)− V (si)|
NTranslation Quality Assessment: Evaluation and Estimation 56 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set
Translation Quality Assessment: Evaluation and Estimation 57 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set
Translation Quality Assessment: Evaluation and Estimation 57 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Results - scoring sub-task (WMT12)
System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75
UU best 0.64 0.79SDLLW SVM 0.64 0.78
UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82
UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82
Baseline bb17 SVR 0.69 0.82Loria SVMrbf 0.69 0.83
SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85
PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86
DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99
UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12
UPC 2 0.87 1.04TCD M5P-all 2.09 2.32
Translation Quality Assessment: Evaluation and Estimation 58 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Results - scoring sub-task (WMT13)
System ID MAE RMSE• SHEF FS 12.42 15.74
SHEF FS-AL 13.02 17.03CNGL SVRPLS 13.26 16.82
LIMSI 13.32 17.22DCU-SYMC combine 13.45 16.64DCU-SYMC alltypes 13.51 17.14
CMU noB 13.84 17.46CNGL SVR 13.85 17.28
FBK-UEdin extra 14.38 17.68FBK-UEdin rand-svr 14.50 17.73
LORIA inctrain 14.79 18.34Baseline bb17 SVR 14.81 18.22
TCD-CNGL open 14.81 19.00LORIA inctraincont 14.83 18.17
TCD-CNGL restricted 15.20 19.59CMU full 15.25 18.97
UMAC 16.97 21.94
Translation Quality Assessment: Evaluation and Estimation 59 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
Agreement between annotators
Absolute value judgements: difficult to achieveconsistency even in highly controlled settings
WMT12: 30% of initial dataset discardedRemaining annotations had to be scaled
Translation Quality Assessment: Evaluation and Estimation 60 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
Annotation costs: active learning to select subset ofinstances to be annotated (Beck et al., ACL 2013)
Translation Quality Assessment: Evaluation and Estimation 61 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
Curse of dimensionality: feature selection to identifyrelevant info for dataset (Shah et al., MT Summit 2013)
0.68
21'
0.48
57'
0.44
01'
0.5313'
0.4614' 0.
5339'
0.35
91'
0.5462'
0.554'
0.67
17'
0.47
19'
0.4292
'
0.52
65'
0.4741' 0.5437'
0.35
78'
0.5399'
0.5401'
0.629'
0.449'
0.4183'
0.5123'
0.44
93' 0.51
13'
0.34
01'
0.5301'
0.5401'
0.63
24'
0.4471'
0.41
69'
0.51
09'
0.46
09' 0.5309'
0.34
09'
0.52
49'
0.52
49'
0.61
31'
0.43
2'
0.41
1'
0.5025'
0.441'
0.506'
0.33
7'
0.52
1'
0.51
94'
0.3'
0.35'
0.4'
0.45'
0.5'
0.55'
0.6'
0.65'
0.7'
WMT12'
EAMT211'(en2es)'
EAMT211'(fr2en)'
EAMT2092s1'
EAMT2092s2'
EAMT2092s3'
EAMT2092s4'
GALE112s1'
GALE112s2''
BL' AF' BL+PR' AF+PR' FS'
Common feature set identified, but nuanced subsets forspecific datasets
Translation Quality Assessment: Evaluation and Estimation 62 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
Curse of dimensionality: feature selection to identifyrelevant info for dataset (Shah et al., MT Summit 2013)
0.68
21'
0.48
57'
0.44
01'
0.5313'
0.4614' 0.
5339'
0.35
91'
0.5462'
0.554'
0.67
17'
0.47
19'
0.4292
'
0.52
65'
0.4741' 0.5437'
0.35
78'
0.5399'
0.5401'
0.629'
0.449'
0.4183'
0.5123'
0.44
93' 0.51
13'
0.34
01'
0.5301'
0.5401'
0.63
24'
0.4471'
0.41
69'
0.51
09'
0.46
09' 0.5309'
0.34
09'
0.52
49'
0.52
49'
0.61
31'
0.43
2'
0.41
1'
0.5025'
0.441'
0.506'
0.33
7'
0.52
1'
0.51
94'
0.3'
0.35'
0.4'
0.45'
0.5'
0.55'
0.6'
0.65'
0.7'
WMT12'
EAMT211'(en2es)'
EAMT211'(fr2en)'
EAMT2092s1'
EAMT2092s2'
EAMT2092s3'
EAMT2092s4'
GALE112s1'
GALE112s2''
BL' AF' BL+PR' AF+PR' FS'
Common feature set identified, but nuanced subsets forspecific datasets
Translation Quality Assessment: Evaluation and Estimation 62 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric
MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation 63 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric
MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation 63 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Open issues
How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric
MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation 63 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Outline
1 Translation quality
2 Manual metrics
3 Task-based metrics
4 Reference-based metrics
5 Quality estimation
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 64 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metrics
Use various metricsInvent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metrics
Invent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still anopen problem
Different metrics for: different purposes/users, differentneeds, different notions of quality
Quality estimation: learning of these different notions,but requires labelled data
Solution:
Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!
Translation Quality Assessment: Evaluation and Estimation 65 / 66
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions
Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of [email protected]
EXPERT Winter School, 12 November 2013
Translation Quality Assessment: Evaluation and Estimation 66 / 66