Download - 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Translation Quality Assessment:

Evaluation and Estimation

Lucia Specia

University of [email protected]

EXPERT Winter School, 12 November 2013

Translation Quality Assessment: Evaluation and Estimation 1 / 66


Overview

“Machine Translation evaluation is better understood thanMachine Translation”

(Carbonell and Wilks, 1991)



Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions



Outline


2 Manual metrics




6 Conclusions



Why is evaluation important?

Compare MT systems

Measure progress of MT systems over time

Diagnose of MT systems

Assess (and pay) human translators

Quality assurance

Tuning of SMT systems

Decision on fitness-for-purpose

...



Why is evaluation hard?

What does quality mean?

Fluent?Adequate?Easy to post-edit?

Quality for whom/what?

End-user: gisting (Google Translate), internalcommunications, or publication (dissemination)MT-system: tuning or diagnosisPost-editor: draft translations (light vs heavypost-editing)Other applications, e.g. CLIR



Overview

Ref: Do not buy this product, it’s their craziest invention!

MT: Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

MT: Six-hour battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved



Overview

How do we measure quality?

Manual metrics:

Error counts, ranking, acceptability, 1-N judgements onfluency/adequacyTask-based human metrics: productivity tests(HTER, PE time, keystrokes), user-satisfaction, readingcomprehension

Automatic metrics:

Based on human references: BLEU, METEOR, TER, ...Reference-less: quality estimation



Outline


2 Manual metrics




6 Conclusions



Judgements in an n-point scale

Adequacy using 5-point scale (NIST-like)

5 All meaning expressed in the source fragment appears inthe translation fragment.

4 Most of the source fragment meaning is expressed in thetranslation fragment.

3 Much of the source fragment meaning is expressed in thetranslation fragment.

2 Little of the source fragment meaning is expressed in thetranslation fragment.

1 None of the meaning expressed in the source fragment isexpressed in the translation fragment.




Fluency using 5-point scale (NIST-like)

5 Native language fluency. No grammar errors, goodword choice and syntactic structure in the translationfragment.

4 Near native language fluency. Few terminology orgrammar errors which don’t impact the overallunderstanding of the meaning.

3 Not very fluent. About half of translation containserrors.

2 Little fluency. Wrong word choice, poor grammar andsyntactic structure.

1 No fluency. Absolutely ungrammatical and for the mostpart doesn’t make any sense.




Issues:

Subjective judgements

Hard to reach significant agreement

Is it realible at all?Can we use multiple annotators?

Are fluency and adequacy really separable?

Ref: Absolutely ungrammatical and for the most part doesn’t

make any sense.

MT: Absolutely sense doesn’t ungrammatical for the and most

make any part.



Ranking

WMT-13 Appraise tool: rank translations best-worst (w. ties)



Ranking

Issues:

Subjective judgements: what does “best” mean?

Hard to judge for long sentences

Ref: The majority of existing work focuses on predicting some

form of post-editing effort to help professional translators.

MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.

MT2: The majority of existing work focuses on predicting some

form of post-editing effort to help machine translation.

Only serve for comparison purposes - the best systemmight not be good enough

Absolute evaluation can do both



Error counts

More fine-grained

Aimed at diagnosis of MT systems, quality control ofhuman translation.

E.g.: Multidimensional Quality Metrics (MQM)

Machine and human translation qualityTakes quality of source text into accountActual metric is based on a specification



MQM

Issues selected based on a given specification (dimensions):

Language/locale

Subject field/domain

Text Type

Audience

Purpose

Register

Style

Content correspondence

Output modality, ...



MQM

Issue types (core):

Altogether: 120 categories



MQM

Issue types: http://www.qt21.eu/launchpad/content/

high-level-structure-0

Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)

translate5: open source graphical (Web) interface for inlineerror annotation: www.translate5.net


http://www.qt21.eu/launchpad/content/high-level-structure-0

http://www.qt21.eu/launchpad/content/high-level-structure-0

www.translate5.net


MQM



Error counts

Issues:

Time consuming

Requires training, esp. to distinguish betweenfine-grained error types

Different errors are more relevant for differentspecifications: need to select and weight them accordingly



Outline


2 Manual metrics




6 Conclusions



Post-editing

Productivity analysis: Measure translation quality withintask. E.g. Autodesk - Productivity test through post-editing

2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment



Post-editing

PET: Records time, keystrokes, edit distance



Post-editing - PET



Post-editing - PET

How often post-editing (PE) a translation tool output isfaster than translating from scratch (HT):

System Faster than HT

Google 94%Moses 86.8%

Systran 81.20%Trados 72.40%

Comparing the time to translate from scratch with the timeto PE MT, in seconds:

Annotator HT (s) PE (s) HT/PE PE/HT

Average 31.89 18.82 1.73 0.59Deviation 9.99 6.79 0.26 0.09



User satisfaction

Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!



User satisfaction




Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!



User satisfaction




Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!



Reading comprehension

Defense language proficiency test (Jones et al., 2005):




MT quality: function of

1 Text passage comprehension, as measured by answersaccuracy, and

2 Time taken to complete a test item (read a passage +answer its questions)




Compared to Human Translation (HT):



Task-based metrics

Issues:

Final goal needs to be very clear

Can be more cost/time consuming

Final task has to have a meaningful metric

Other elements may affect the final quality measurement(e.g. Chinese vs. Americans)



Outline


2 Manual metrics




6 Conclusions



Automatic metrics

Compare output of an MT system to one or morereference (usually human) translations: how close is theMT output to the reference translation?

Numerous metrics: BLEU, NIST, etc.

Advantages:Fast and cheap, minimal human labour, no need forbilingual speakers

Once test set is created, can be reused many timesCan be used on an on-going basis during systemdevelopment to test changes



Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are required

Very few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations



Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differently

Reference translations are only a subset of thepossible good translations



Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations



String matching

BLEU: BiLingual Evaluation Understudy

Most widely used metric, both for MT systemevaluation/comparison and SMT tuning

Matching of n-grams between MT and Ref: rewardssame words in equal order

#clip(g) count of reference n-grams g which happen in ahypothesis sentence h clipped by the number of times gappears in the reference sentence for h; #(g ′) = numberof n-grams in hypotheses

n-gram precision pn for a set of MT translations H =

pn =

∑h∈H

∑g∈ngrams(h) #clip(g)∑

h∈H∑

g ′∈ngrams(h) #(g ′)



BLEU

Combine (mean of the log) 1-n n-gram precisions∑n

log pn

Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference

Compares the overall number of words wh of the entirehypotheses set with ref length wr :

BP =

{1 if wh ≥ wr

e(1−wr/wh) otherwise

BLEU = BP ∗ exp

(∑n

log pn

)



BLEU

Combine (mean of the log) 1-n n-gram precisions∑n

log pn

Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference

Compares the overall number of words wh of the entirehypotheses set with ref length wr :

BP =

{1 if wh ≥ wr

e(1−wr/wh) otherwise

BLEU = BP ∗ exp

(∑n

log pn

)Translation Quality Assessment: Evaluation and Estimation 37 / 66


BLEU

Scale: 0-1, but highly dependent on the test set

Rewards fluency by matching high n-grams (up to 4)

Adequacy rewarded by unigrams and brevity penalty –poor model of recall

Synonyms and paraphrases only handled if they are inany of the reference translations

All tokens are equally weighted: missing out on acontent word = missing out on a determiner

Better for evaluating changes in the same system thancomparing different MT architectures



BLEU

Not good at sentence-level, unless smoothing is applied:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

1-gram precision: 4/82-gram precision: 1/73-gram precision: 0/64-gram precision: 0/5BLEU = 0



BLEU

Importance of clipping and brevity penalty

Ref1: the Iraqi weapons are to be handed over to the armywithin two weeks

Ref2: the Iraqi weapons will be surrendered to the army in twoweeks

MT: the the the theCount for the should be clipped at 2: max count of theword in any reference. Unigram score = 2/4 (not 4/4)

MT: Iraqi weapons will be1-gram precision: 4/42-gram precision: 3/33-gram precision: 2/24-gram precision: 1/1Precision (pn) = 1

Precision score penalised because h < r



String matching

BLEU:



Edit distance

WER: Word Error Rate:

Levenshtein edit distance

Minimum proportion of insertions, deletions, andsubstitutions needed to transform an MT sentence intothe reference sentence

Heavily penalises reorderings: correct translation in thewrong location: deletion + insertion

WER =S + D + I

N

PER: Position-independent word Error Rate:

Does not penalise reorderings: output and referencesentences are unordered sets



Edit distance: TER

TER: Translation Error Rate

Adds shift operation

REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times

HYP: [this week] the saudis denied information published in the ***** new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4

13= 0.31

Human-targeted TER (HTER)

TER between MT and its post-edited version



Edit distance: TER

TER:



Alignment-based

METEOR:

Unigram Precision and Recall

Align MT output with reference. Take best scoring pairfor multiple refs.

Matching considers word inflection variations (stems),synonyms/paraphrases

Fluency addressed via a direct penalty: fragmentation ofthe matching

METEOR score = F-mean score discounted forfragmentation = F-mean * (1 - DF)



METEOR

Example:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

Matching:

Ref: Iraqi weapons army two weeks

MT two weeks Iraq’s weapons army

P = 5/8 =0.625

R = 5/14 = 0.357

F-mean = 10*P*R/(9P+R) = 0.3731

Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6

Discounting factor: DF = 0.5 * (0.6**3) = 0.108

METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333



Others

WMT shared task on metrics:

TerroCat

DepRef

MEANT and TINE

TESLA

LEPOR

ROSE

AMBER

Many other linguistically motivated metrics wherematching is not done at the word-level (only)

...



Outline


2 Manual metrics




6 Conclusions



Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?



Framework

MachineLearning

X: examples of source &

translations

QE modelY: Quality scores for

examples in X

Feature extraction

Features



Framework

MT systemTranslation

for xt'

QE modelQuality score

y'

Features

Feature extraction

SourceText x

s'



Framework

Main components to build a QE system:

1 Definition of quality: what to predict

2 (Human) labelled data (for quality)

3 Features

4 Machine learning algorithm



Definition of quality

Predict 1-N absolute scores for adequacy/fluency

Predict 1-N absolute scores for post-editing effort

Predict average post-editing time per word

Predict relative rankings

Predict relative rankings for same source

Predict percentage of edits needed for sentence

Predict word-level edits and its types

Predict BLEU, etc. scores for document



Datasets

SHEF (several): http://staffwww.dcs.shef.ac.uk/

people/L.Specia/resources.html

LIG (10K, fr-en): http://www-clips.imag.fr/geod/

User/marion.potet/index.php?page=download

LMSI (14K, fr-en, en-fr, 2 post-editors):http://web.limsi.fr/Individu/wisniews/

recherche/index.html


http://staffwww.dcs.shef.ac.uk/people/L.Specia/resources.html

http://staffwww.dcs.shef.ac.uk/people/L.Specia/resources.html

http://www-clips.imag.fr/geod/User/marion.potet/index.php?page=download

http://www-clips.imag.fr/geod/User/marion.potet/index.php?page=download

http://web.limsi.fr/Individu/wisniews/recherche/index.html

http://web.limsi.fr/Individu/wisniews/recherche/index.html


Features

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators



QuEst

Goal: framework to explore features for QE

Feature extractors for 150+ features of all types: Java

Machine learning: wrappers for a number of algorithmsin the scikit-learn toolkit, grid search, feature selection

Open source:http://www.quest.dcs.shef.ac.uk/


http://www.quest.dcs.shef.ac.uk/


State of the art in QE

WMT12-13 shared tasks

Sentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N




WMT12-13 shared tasksSentence- and word-level estimation of PE effort

Datasets and language pairs:


Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N




WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:


Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N




WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:


Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

NTranslation Quality Assessment: Evaluation and Estimation 56 / 66


Baseline system

Features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set



Results - scoring sub-task (WMT12)

System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75

UU best 0.64 0.79SDLLW SVM 0.64 0.78

UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82

UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82

Baseline bb17 SVR 0.69 0.82Loria SVMrbf 0.69 0.83

SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85

PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86

DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99

UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12

UPC 2 0.87 1.04TCD M5P-all 2.09 2.32



Results - scoring sub-task (WMT13)

System ID MAE RMSE• SHEF FS 12.42 15.74

SHEF FS-AL 13.02 17.03CNGL SVRPLS 13.26 16.82

LIMSI 13.32 17.22DCU-SYMC combine 13.45 16.64DCU-SYMC alltypes 13.51 17.14

CMU noB 13.84 17.46CNGL SVR 13.85 17.28

FBK-UEdin extra 14.38 17.68FBK-UEdin rand-svr 14.50 17.73

LORIA inctrain 14.79 18.34Baseline bb17 SVR 14.81 18.22

TCD-CNGL open 14.81 19.00LORIA inctraincont 14.83 18.17

TCD-CNGL restricted 15.20 19.59CMU full 15.25 18.97

UMAC 16.97 21.94



Open issues

Agreement between annotators

Absolute value judgements: difficult to achieveconsistency even in highly controlled settings

WMT12: 30% of initial dataset discardedRemaining annotations had to be scaled



Open issues

Annotation costs: active learning to select subset ofinstances to be annotated (Beck et al., ACL 2013)



Open issues

Curse of dimensionality: feature selection to identifyrelevant info for dataset (Shah et al., MT Summit 2013)

0.68

21'

0.48

57'

0.44

01'

0.5313'

0.4614' 0.

5339'

0.35

91'

0.5462'

0.554'

0.67

17'

0.47

19'

0.4292

'

0.52

65'

0.4741' 0.5437'

0.35

78'

0.5399'

0.5401'

0.629'

0.449'

0.4183'

0.5123'

0.44

93' 0.51

13'

0.34

01'

0.5301'

0.5401'

0.63

24'

0.4471'

0.41

69'

0.51

09'

0.46

09' 0.5309'

0.34

09'

0.52

49'

0.52

49'

0.61

31'

0.43

2'

0.41

1'

0.5025'

0.441'

0.506'

0.33

7'

0.52

1'

0.51

94'

0.3'

0.35'

0.4'

0.45'

0.5'

0.55'

0.6'

0.65'

0.7'

WMT12'

EAMT211'(en2es)'

EAMT211'(fr2en)'

EAMT2092s1'

EAMT2092s2'

EAMT2092s3'

EAMT2092s4'

GALE112s1'

GALE112s2''

BL' AF' BL+PR' AF+PR' FS'

Common feature set identified, but nuanced subsets forspecific datasets



Open issues

How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it



Outline


2 Manual metrics




6 Conclusions



Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!



Conclusions




Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metrics

Use various metricsInvent your own metric!



Conclusions




Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metrics

Invent your own metric!



Conclusions




Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!



Translation Quality Assessment:

Evaluation and Estimation

Lucia Specia

University of [email protected]

EXPERT Winter School, 12 November 2013


Download - 10. Lucia Specia (USFD) Evaluation of Machine Translation

Top Related