02 qe lucia specia
TRANSCRIPT
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 1/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of [email protected]
9 September 2013
Translation Quality Assessment: Evaluation and Estimation 1 / 33
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 2/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
“MT Evaluation is better understood than MT” (Carbonelland Wilks, 1991)
Translation Quality Assessment: Evaluation and Estimation 2 / 33
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 3/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Outline
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 3 / 33
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 4/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Outline
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 4 / 33
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 5/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
What does quality mean?
Fluent?Adequate?Easy to post-edit?
Translation Quality Assessment: Evaluation and Estimation 5 / 33
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 6/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
What does quality mean?
Fluent?Adequate?Easy to post-edit?
Quality for whom?
End-userMT-system (tuning)Post-editorOther applications (e.g. CLIR)
Translation Quality Assessment: Evaluation and Estimation 5 / 33
T l i Q li QTL hP d P j Q li E i i S f h i QE O i C l i
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 7/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
What does quality mean?
Fluent?Adequate?Easy to post-edit?
Quality for whom?
End-userMT-system (tuning)Post-editorOther applications (e.g. CLIR)
Quality for what?
Internal communicationsDissemination (publishing)Gisting (Google Translate)Draft translations (light vs heavy post-editing)MT system improvement (diagnosis)
Translation Quality Assessment: Evaluation and Estimation 5 / 33
T l ti Q lit QTL hP d P j t Q lit E ti ti St t f th t i QE O i C l i
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 8/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
ref Do not buy this product, it’s their craziest invention!
sys Do buy this product, it’s their craziest invention!
Translation Quality Assessment: Evaluation and Estimation 6 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 9/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
ref Do not buy this product, it’s their craziest invention!
sys Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Translation Quality Assessment: Evaluation and Estimation 6 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 10/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
ref Do not buy this product, it’s their craziest invention!
sys Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
ref The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.
Translation Quality Assessment: Evaluation and Estimation 6 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 11/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
ref Do not buy this product, it’s their craziest invention!
sys Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
ref The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation 6 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 12/62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Overview
How do we measure quality?
Human metrics: error counts (which?), ranking,acceptability, 1-N fluency/adequacy
Automatic metrics based on human references: (BLEU,METEOR, TER, etc.
Semi-automatic metrics based on post-editions: HTER,PE time, eye-tracking, etc.
Translation Quality Assessment: Evaluation and Estimation 7 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 13/62
Q y Q j Q y Q p
Overview
How do we measure quality?
Human metrics: error counts (which?), ranking,acceptability, 1-N fluency/adequacy
Automatic metrics based on human references: (BLEU,METEOR, TER, etc.
Semi-automatic metrics based on post-editions: HTER,PE time, eye-tracking, etc.
Automatic metrics without references: qualityestimation
Translation Quality Assessment: Evaluation and Estimation 7 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 14/62
y j y p
Outline
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 8 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 15/62
QTLaunchPad project
http://www.qt21.eu/launchpad/
Multidimensional Quality Metrics (MQM) based on aspecification
Machine and human translation quality
Manual and (semi-)automatic assessmentTakes quality of source text into account
MT system improvement, gisting, dissemination, etc.
Translation Quality Assessment: Evaluation and Estimation 9 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 16/62
Multidimensional Quality Metrics (MQM)
Issues selected based on a given specification (dimensions):Language/locale
Subject field/domain
Terminology
Text TypeAudience
Purpose
Register
Style
Content correspondence
Output modality, ...
Translation Quality Assessment: Evaluation and Estimation 10 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 17/62
Multidimensional Quality Metrics (MQM)
Issue types (core):
Translation Quality Assessment: Evaluation and Estimation 11 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 18/62
Multidimensional Quality Metrics (MQM)
Issue types: http://www.qt21.eu/launchpad/content/high-level-structure-0
Combining issue types:TQ = 100 − AccP − (FluP T − FluP S ) − (VerP T − VerP S )
Translation Quality Assessment: Evaluation and Estimation 12 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 19/62
Outline
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 13 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 20/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 21/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 22/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Quality = Can we publish it as is?
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 23/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 24/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
O
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 25/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
O i
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 26/62
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Quality = What’s this translation’s MQM score?
Translation Quality Assessment: Evaluation and Estimation 14 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
F k
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 27/62
Framework
Translation Quality Assessment: Evaluation and Estimation 15 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
F k
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 28/62
Framework
Translation Quality Assessment: Evaluation and Estimation 15 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
F k
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 29/62
Framework
Main components to build a QE system:
1 Definition of quality: what to predict
2 (Human) labelled data (for quality)3 Features
4 Machine learning algorithm
Translation Quality Assessment: Evaluation and Estimation 16 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
D fi iti f lit
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 30/62
Definition of quality
Predict 1-N absolute scores for adequacy/fluency
Predict 1-N absolute scores for post-editing effort
Predict average post-editing time per word
Predict relative rankings
Predict relative rankings for same source
Predict percentage of edits needed for sentence
Predict word-level edits and its types
Predict BLEU, etc. scores for document
Translation Quality Assessment: Evaluation and Estimation 17 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
D t s ts
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 31/62
Datasets
SHEF (several): http://staffwww.dcs.shef.ac.uk/people/L.Specia/resources.html
LIG (10K, fr-en): http://www-clips.imag.fr/geod/User/marion.potet/index.php?page=download
LMSI (14K, fr-en, en-fr, 2 post-editors):http://web.limsi.fr/Individu/wisniews/
recherche/index.html
Translation Quality Assessment: Evaluation and Estimation 18 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Features
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 32/62
Features
Translation Quality Assessment: Evaluation and Estimation 19 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
QuEst
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 33/62
QuEst
Goal: framework to explore features for QE
Feature extractors for 150+ features of all types: Java
Machine learning: Gaussian Processes & scikit-learntoolkit (Python), with wrappers for a number of algorithms, grid search, feature selection
Open source:http://www.quest.dcs.shef.ac.uk/
Translation Quality Assessment: Evaluation and Estimation 20 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Outline
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 34/62
Outline
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 21 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Shared Task
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 35/62
Shared Task
WMT12-13 – with Radu Soricut & Christian Buck
Translation Quality Assessment: Evaluation and Estimation 22 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Shared Task
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 36/62
Shared Task
WMT12-13 – with Radu Soricut & Christian Buck
Sentence- and word-level estimation of PE effort
Translation Quality Assessment: Evaluation and Estimation 22 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Shared Task
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 37/62
Shared Task
WMT12-13 – with Radu Soricut & Christian Buck
Sentence- and word-level estimation of PE effortDatasets and language pairs:
Quality Year Languages
1-5 subjective scores WMT12 en-es
Ranking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-es
Post-editing time WMT13 en-es
Word-level edits: change/keep WMT13 en-es
Word-level edits: keep/delete/replace WMT13 en-es
Ranking 5 MTs per source WMT13 en-es; de-en
Translation Quality Assessment: Evaluation and Estimation 22 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Shared Task
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 38/62
Shared Task
WMT12-13 – with Radu Soricut & Christian Buck
Sentence- and word-level estimation of PE effortDatasets and language pairs:
Quality Year Languages
1-5 subjective scores WMT12 en-es
Ranking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-es
Post-editing time WMT13 en-es
Word-level edits: change/keep WMT13 en-es
Word-level edits: keep/delete/replace WMT13 en-es
Ranking 5 MTs per source WMT13 en-es; de-en
Evaluation metric:
MAE =
N
i =1 |H (s i ) − V (s i )|
N Translation Quality Assessment: Evaluation and Estimation 22 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Baseline system
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 39/62
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
Translation Quality Assessment: Evaluation and Estimation 23 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Baseline system
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 40/62
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ , and C
optimised using a grid-search and 5-fold cross validation on thetraining set
Translation Quality Assessment: Evaluation and Estimation 23 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Results - scoring sub-task (WMT12)
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 41/62
Results scoring sub task (WMT12)
System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75
UU best 0.64 0.79SDLLW SVM 0.64 0.78
UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82
UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82
Baseline bb17 SVR 0.69 0.82Loria SVMrbf 0.69 0.83SJTU 0.69 0.83
WLV-SHEF FS 0.69 0.85PRHLT-UPV 0.70 0.85
WLV-SHEF BL 0.72 0.86DCU-SYMC unconstrained 0.75 0.97
DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99
UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12
UPC 2 0.87 1.04TCD M5P-all 2.09 2.32
Translation Quality Assessment: Evaluation and Estimation 24 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Results - scoring sub-task (WMT13)
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 42/62
Results scoring sub task (WMT13)
System ID MAE RMSE• SHEF FS 12.42 15.74
SHEF FS-AL 13.02 17.03CNGL SVRPLS 13.26 16.82
LIMSI 13.32 17.22DCU-SYMC combine 13.45 16.64DCU-SYMC alltypes 13.51 17.14
CMU noB 13.84 17.46CNGL SVR 13.85 17.28
FBK-UEdin extra 14.38 17.68FBK-UEdin rand-svr 14.50 17.73
LORIA inctrain 14.79 18.34Baseline bb17 SVR 14.81 18.22
TCD-CNGL open 14.81 19.00
LORIA inctraincont 14.83 18.17TCD-CNGL restricted 15.20 19.59
CMU full 15.25 18.97UMAC 16.97 21.94
Translation Quality Assessment: Evaluation and Estimation 25 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Outline
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 43/62
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 26 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Agreement between annotators
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 44/62
g
Absolute value judgements: difficult to achieveconsistency even in highly controlled settings
30% of initial dataset discardedRemaining annotations had to be scaled
Translation Quality Assessment: Evaluation and Estimation 27 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Agreement between annotators
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 45/62
g
Absolute value judgements: difficult to achieveconsistency even in highly controlled settings
30% of initial dataset discardedRemaining annotations had to be scaled
More objective absolute scores
Post-editing time, HTER, editsAlso subject to huge variance (WPTP 2013, Wisniewskiet. al, MT-Summit 2013)Multi-task learning to address this variance (Cohn andSpecia, ACL 2013)
Translation Quality Assessment: Evaluation and Estimation 27 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Agreement between annotators
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 46/62
g
Absolute value judgements: difficult to achieveconsistency even in highly controlled settings
30% of initial dataset discardedRemaining annotations had to be scaled
More objective absolute scores
Post-editing time, HTER, editsAlso subject to huge variance (WPTP 2013, Wisniewskiet. al, MT-Summit 2013)Multi-task learning to address this variance (Cohn andSpecia, ACL 2013)
Relative scores
Different task altogetherWMT13: better results than reference-based metrics
Translation Quality Assessment: Evaluation and Estimation 27 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Annotation costs
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 47/62
Active learning to select subset of instances to be annotated(Beck et al., ACL 2013)
Translation Quality Assessment: Evaluation and Estimation 28 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Curse of dimensionality
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 48/62
Feature selection to identify relevant info for dataset (Shah
et al., MT Summit 2013) ! " # $ % &
! " ( $ ) *
! " ( ( ! &
! " ) + & +
! " ( # & (
! " ) + + ,
! " + ) , &
! " ) ( # %
! " ) ) (
! " # * & *
! " ( * & ,
! " ( % , %
! " ) % # )
! " ( * ( & !
" ) ( + *
! " + ) * $
! " ) + , ,
! " ) ( ! &
! " # % ,
!
" ( ( ,
! " ( &
$ +
! " ) & % +
!
" ( ( , + !
" ) & & +
! " + ( ! &
! " ) + ! &
! " ) ( ! &
! " # + % (
! " ( ( * &
! " ( & # ,
! " ) & ! ,
! " ( # ! , !
" ) + ! ,
! " + ( ! ,
! " ) % ( ,
! " ) % ( ,
! " # & + &
! " (
+ %
! " ( & &
! " ) ! % )
! " ( ( &
! " ) ! #
! " + + *
! " ) % &
! " ) & , (
!"+
!"+)
!"(
!"()
!")
!"))
!"#
!"#)
!"*
- . / &
%
0 1 . / 2 & & 3
4 5 2 4 6 7
0 1 . /
2 & & 3 8 9 2
4 5 7
0 1 .
/ 2 ! , 2
6 &
0 1 .
/ 2 ! , 2
6 %
0 1 .
/ 2 ! , 2
6 +
0 1 .
/ 2 ! , 2
6 (
: 1 ; 0
& & 2 6 &
:
1 ; 0 &
& 2 6 %
<; 1= <;>?@ 1=>?@ =A
Translation Quality Assessment: Evaluation and Estimation 29 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Curse of dimensionality
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 49/62
Feature selection to identify relevant info for dataset (Shah
et al., MT Summit 2013) ! " # $ % &
! " ( $ ) *
! " ( ( ! &
! " ) + & +
! " ( # & (
! " ) + + ,
! " + ) , &
! " ) ( # %
! " ) ) (
! " # * & *
! " ( * & ,
! " ( % , %
! " ) % # )
! " ( * ( & !
" ) ( + *
! " + ) * $
! " ) + , ,
! " ) ( ! &
! " # % ,
!
" ( ( ,
! " ( &
$ +
! " ) & % +
!
" ( ( , + !
" ) & & +
! " + ( ! &
! " ) + ! &
! " ) ( ! &
! " # + % (
! " ( ( * &
! " ( & # ,
! " ) & ! ,
! " ( # ! , !
" ) + ! ,
! " + ( ! ,
! " ) % ( ,
! " ) % ( ,
! " # & + &
! " (
+ %
! " ( & &
! " ) ! % )
! " ( ( &
! " ) ! #
! " + + *
! " ) % &
! " ) & , (
!"+
!"+)
!"(
!"()
!")
!"))
!"#
!"#)
!"*
- . / &
%
0 1 . / 2 & & 3
4 5 2 4 6 7
0 1 . /
2 & & 3 8 9 2
4 5 7
0 1 .
/ 2 ! , 2
6 &
0 1 .
/ 2 ! , 2
6 %
0 1 .
/ 2 ! , 2
6 +
0 1 .
/ 2 ! , 2
6 (
: 1 ; 0
& & 2 6 &
: 1 ; 0
& & 2 6 %
<; 1= <;>?@ 1=>?@ =A
Common feature set identified, but nuanced subsets forspecific datasets
Translation Quality Assessment: Evaluation and Estimation 29 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
How to use estimated PE effort scores?
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 50/62
Do users prefer detailed estimates (sub-sentence level) or anoverall estimate for the complete sentence or not seeing
bad sentences at all?
Too much information vs hard-to-interpret scores
Translation Quality Assessment: Evaluation and Estimation 30 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
How to use estimated PE effort scores?
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 51/62
Do users prefer detailed estimates
(sub-sentence level) or anoverall estimate for the complete sentence or not seeing
bad sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness
metric
Translation Quality Assessment: Evaluation and Estimation 30 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
How to use estimated PE effort scores?
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 52/62
Do users prefer detailed estimates
(sub-sentence level) or anoverall estimate for the complete sentence or not seeing
bad sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness
metric
MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation 30 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Feature engineering
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 53/62
Two families of features missing in current work:
Can we benefit from contextual, document-wide
information?
Can we predict human translation quality?
Translation Quality Assessment: Evaluation and Estimation 31 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Feature engineering
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 54/62
Two families of features missing in current work:
Can we benefit from contextual, document-wide
information?
Can we predict human translation quality?
Don’t panic, you can help!
Two (sub-) QuEst projects at MTM-2013 :-)
Translation Quality Assessment: Evaluation and Estimation 31 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Outline
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 55/62
1 Translation Quality
2 QTLaunchPad Project
3 Quality Estimation
4 State of the art in QE
5 Open issues
6 Conclusions
Translation Quality Assessment: Evaluation and Estimation 32 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 56/62
(Machine) Translation evaluation is still an open problem
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 57/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 58/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Quality estimation: learning of these different notions
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 59/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Quality estimation: learning of these different notions
Estimates have been used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 60/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Quality estimation: learning of these different notions
Estimates have been used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 61/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Quality estimation: learning of these different notions
Estimates have been used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
Translation Quality Assessment: Evaluation and Estimation 33 / 33
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions
Conclusions
8/12/2019 02 Qe Lucia Specia
http://slidepdf.com/reader/full/02-qe-lucia-specia 62/62
(Machine) Translation evaluation is still an open problemDifferent purposes/users, different needs, different notionsof quality
Quality estimation: learning of these different notions
Estimates have been used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
Interesting open issues: join the QuEst projects!
Translation Quality Assessment: Evaluation and Estimation 33 / 33