challenges in predicting machine translation utility for ... · machine translation as a starting...

Post on 05-Jul-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

MT with Human Post-Editing

Source Text

FastTranslation

Translators

MT System

Good FastTranslation

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

This Work

Machine translation as a starting point for human translators

• Goal is utility for post-editing

• Compare post-editing to traditional adequacy-driven tasks

Examine results of a post-editing experiment

• Simulate a real-world localization scenario

• Examine challenges in predicting translation usefulness forhuman translators

Adequacy Tasks

Adequacy: semantic similarity to reference translations

Significant research efforts on improving end quality of machinetranslation:

• ACL Workshops on Statistical Machine Translation(Callison-Burch et al., 2011)

• NIST Open Machine Translation Evaluations(Przybocki et al., 2009)

Measured by absolute scores or rankings

Motivation: MT for user consumption, input for other NLP tasks

Post-Editing

Human-targeted translation edit rate (HTER, Snover et al., 2006)

1. Human translators correct MT output

2. Automatically calculate number of edits using TER

TER =# of edits

# of reference words

Edits: insertion, deletion, substitution, block shift

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

0.27

2: He had to pay lubosi Lubos G. half a million kronor.

0.09

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

0.27

2: He had to pay lubosi Lubos G. half a million kronor.

0.09

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49

0.29

2: The problem is that the durability of lines is two or four years.

0.34

0.14

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two of the lines , up to is two to four years.

0.49

0.29

2: The problem is that the durability life of lines is two or to four years.

0.34

0.14

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

MT Post-Editing Experiment

Data collected from professional translators (in training):

Post-edited translations

Expert post-editing ratings1: No editing required2: Minor editing, meaning preserved3: Major editing, meaning lost4: Re-translate

From parallel data:

Independent reference translations

MT Post-Editing Experiment

Evaluate post-edited results using standard MT evaluation metrics:

BLEU (Papineni et al., 2002):

• n-gram precision with a brevity penalty

TER (Snover et al., 2006):

• Minimum edit distance

Meteor (Denkowski and Lavie, 2011):

• Tunable alignment-based metric

Task: Reference-assisted utility prediction

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

MT Post-Editing Results

r 4-pt BLEU TER Meteor

4-point – 0.32 0.28 0.33

HTER 0.49 0.26 0.24 0.27

Metric correlation with post-editing scores

MT Post-Editing Experiment

Oracle experiment: tune Meteor to maximize correlation

How well can we (over)fit expert post-editing ratings?

The Meteor Metric

Flexible alignment:

Scoring features:

• Precision/Recall contribution (insertions, deletions)

• Fragmentation penalty (reordering)

• Content/function word contribution

• Flexible match weights

MT Post-Editing Results

r 4-pt BLEU TER Meteor Meteororacle4-point – 0.32 0.28 0.33 0.35

HTER 0.49 0.26 0.24 0.27 0.34

Metric correlation with post-editing scores

MT Post-Editing Experiment

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

MT Post-Editing Experiment

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

WMT 2012 Quality Estimation Task Data

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

10

20

30

40

50

60

70

80

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

50

100

150

200

250

Sent

ence

s

UsableNon-usable

Kent State WMT 2012

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

10

20

30

40

50

60

70

80

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

50

100

150

200

250

Sent

ence

s

UsableNon-usable

Kent State WMT 2012

Usability vs HTER (WMT12)

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

Expert

Rating

HTER

0

20

40

60

80

100

Conclusions

MT for post-editing utility is a significantly different task fromMT for adequacy

Current MT tools under-perform on predicting post-editingusability

Even metrics that use post-editing information (HTER) don’tmatch expert assessments

To improve post-editing usability, we need better data, bettermetrics, better MT systems

Conclusions

www.transcenter.info

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

top related