challenges in predicting machine translation utility for ... · machine translation as a starting...

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

MT System

Translators

MT System

Translators

MT with Human Post-Editing

Source Text

FastTranslation

Translators

MT System

Good FastTranslation

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Introduction

• This work

Introduction

• This work

This Work

Machine translation as a starting point for human translators

• Goal is utility for post-editing

• Compare post-editing to traditional adequacy-driven tasks

Examine results of a post-editing experiment

• Simulate a real-world localization scenario

• Examine challenges in predicting translation usefulness forhuman translators

Adequacy Tasks

Adequacy: semantic similarity to reference translations

Significant research efforts on improving end quality of machinetranslation:

• ACL Workshops on Statistical Machine Translation(Callison-Burch et al., 2011)

• NIST Open Machine Translation Evaluations(Przybocki et al., 2009)

Measured by absolute scores or rankings

Motivation: MT for user consumption, input for other NLP tasks

Post-Editing

Human-targeted translation edit rate (HTER, Snover et al., 2006)

1. Human translators correct MT output

2. Automatically calculate number of edits using TER

TER =# of edits

# of reference words

Edits: insertion, deletion, substitution, block shift

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

2: He had to pay lubosi G. half a million kronor.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

2: He had to pay lubosi Lubos G. half a million kronor.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

2: He had to pay lubosi Lubos G. half a million kronor.

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14

0.49 0.29

0.34 0.14

1: The problem is that life is two of the lines , up to is two to four years.

2: The problem is that the durability life of lines is two or to four years.

1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

Data collected from professional translators (in training):

Post-edited translations

Expert post-editing ratings1: No editing required2: Minor editing, meaning preserved3: Major editing, meaning lost4: Re-translate

From parallel data:

Independent reference translations

Evaluate post-edited results using standard MT evaluation metrics:

BLEU (Papineni et al., 2002):

• n-gram precision with a brevity penalty

TER (Snover et al., 2006):

• Minimum edit distance

Meteor (Denkowski and Lavie, 2011):

• Tunable alignment-based metric

Task: Reference-assisted utility prediction

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Average HTER: 12.4

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Average HTER: 12.4

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Average HTER: 12.4

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

r 4-pt BLEU TER Meteor

4-point – 0.32 0.28 0.33

HTER 0.49 0.26 0.24 0.27

Metric correlation with post-editing scores

Oracle experiment: tune Meteor to maximize correlation

How well can we (over)fit expert post-editing ratings?

The Meteor Metric

Flexible alignment:

Scoring features:

• Precision/Recall contribution (insertions, deletions)

• Fragmentation penalty (reordering)

• Content/function word contribution

• Flexible match weights

r 4-pt BLEU TER Meteor Meteororacle4-point – 0.32 0.28 0.33 0.35

HTER 0.49 0.26 0.24 0.27 0.34

Metric correlation with post-editing scores

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

UsableNon-usable

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

WMT 2012 Quality Estimation Task Data

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

UsableNon-usable

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

UsableNon-usable

Kent State WMT 2012

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

UsableNon-usable

Kent State WMT 2012

Usability vs HTER (WMT12)

0 20 40 60 80 100

Expert

Rating

Conclusions

MT for post-editing utility is a significantly different task fromMT for adequacy

Current MT tools under-perform on predicting post-editingusability

Even metrics that use post-editing information (HTER) don’tmatch expert assessments

To improve post-editing usability, we need better data, bettermetrics, better MT systems

Conclusions

www.transcenter.info

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

challenges in predicting machine translation utility for ... · machine translation as a starting...

Documents

spider: a system for paraphrasing - applicability in machine...

about cls communication - professional writing, editing and...

human translation versus machine translation and full...

synchronous and asynchronous online international ... ·...

translation utility - rockwell...

speech by jin young choi presentation by su jin km editing...

post-editing machine translation: what does it take? ·...

translation and editing of short stories

techniques in translation, computer assisted, machine...

what is the use of machine translation? exploring the...

presentation on dgt's products and services · before...

translation vs post-editing of nmt output: measuring effort...

the efﬁcacy of human post-editing for language...

post editing guidelines for bolt machine translation

eti 305 introduction to literary translation editing and...

silke gutermuth & silvia hansen-schirra university of mainz...

post-editing of machine translation - terminology...

the efﬁcacy of human post-editing for language...

pilot study on medical translations in lay language:...

the qur'an - translation br. shakir - editing br....