goodness: a method for measuring machine translation ...nbach/papers/acl2011-slides.pdf–good word...

Goodness: A Method for Measuring Machine Translation

Confidence

Nguyen Bach Fei Huang and Yaser Al-Onaizan

Carnegie Mellon University IBM T.J. Watson Research Center

1

you totally different from zaid amr , and not to deprive yourself in a

basement of imitation and assimilation .

أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب

التقليد والمحاكاة والذوبان

MT output

Source

you totally different from zaid amr , and not to

deprive yourself in a basement of imitation and

assimilation .

We predict

and visualize

you are quite different from zaid and amr , so do not cram yourself in

the tunnel of simulation , imitation and assimilation . Human

correction

2

Why does it matter?

• Current MT systems don’t have efficient ways to predict machine translation confidence.

• Applications – Post-editing

– Commercial practice for end-users

– Down-stream processing: QA, IE

– MT engine: reranking

3

Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments

• N-best list reranking

• Visualizing

4

5

Confidence Measure

Back Translation (Gao et al. 2006, Bach et al. 2007)

Word Posterior Probability

(Ueffing&Ney, CL 2007)

Predicting BLEU (Soricut&Echihabi,

ACL 2010)

Binary Classification (Blazt et al., JHU 2003; Quirk, LREC

2004; Xiong, ACL 2010)

Features:

- Source & target LM

- WPP

- IBM Model 1

- Target POS

Our work belongs to this category

Can we do a better job?

• Can we use a rich feature set such as dependency structures, source-side information, and alignment context to improve error prediction performance?

• Can we predict translation error types?

• Do confidence measures help the MT system to select a better translation?

• How confidence score can be presented to improve end-user perception?

6

Confidence Measure Models

• Confidence estimation ≈ sequential labelling task

– Word Sequence = MT output

• Two classifiers

– Binary: Good/Bad

– Multi classes: Insertion/Substitution/Shift/Good.

• Supervised training

• Word -> Sentence confidence

7

Word-level Model

• A feature-rich classifier for a given word

• Discriminatively trained by MIRA

8

where f is its feature vector, y is the label, wy is

its feature weight vector

Sentence-level Model

• Given a sentence S, use word-level models to obtain labels for each words

• Use forward and backward values to compute marginal probability of Good label

• Normalize

9

Type 1: Source & Target Dependency Structures

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

null

10




Source Parent = an

Target Parent = that

Are parent aligned: Yes

11

Child-Parent Agreement




Children Agreement: 2

12

Type 2: Source POS and Phrases



VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

null

13

Type 2: Source POS & Phrases




MT output

Source POS

Source

14



1 if source-POS-sequence = “DT DTNN”

f125(target-word = “process”) =

0 otherwise



MT output

Source POS

Source

15





MT output

Source POS

Source

16

Type 3: Alignment Context



null

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS


MT output

Source POS

Source

Target POS

17





He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

18






Source POS

Source

Target POS

19






Source POS

Source

Target POS

20






Source POS

Source

Target POS

21






Source POS

Source

Target POS

22

Recap

• Source & Target Dependency

• Source POS and Phrases

• Alignment Context

23

Experiment Setup

you totally different from zaid amr , and not to deprive yourself in a basement of imitation

and assimilation .

أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب التقليد والمحاكاة والذوبان

MT output

Source

Human

correction

you are quite different from zaid and amr , so do not cram yourself in the tunnel of

simulation , imitation and assimilation .

24

Experiment Setup

• Training data – Human correction Arabic-English system: 72K sentences with 2.2M words;

mix newswire and weblog

• Dev and Unseen test set – Dev: 2,707 sentences, 80K words – Unseen: 2,707 sentences, 79K words

• WPP: implement WPP algorithm in (Ueffing&Ney, 2007)

• WPP+target-POS: similar feature set as in (Xiong, ACL 2010)

• Learner: MIRA, 100 iterations, hyper-param = 5, cut-off = 1

• Evaluation metric: F-score

25

How does the proposed method compare with previous work?

26

Performance of binary classifiers

27

59.4 59.3

69.3 68.7

69.6 69.1

72.1 71.5

72.4 72 72.6 72.2

58

60

62

64

66

68

70

72

74

dev test

F-sc

ore

Test sets

Predicting Good/Bad words

All-Good

WPP (Ueffing&Ney,CL 2007)

WPP+ target POS (Xiong,ACL 2010)

our features

WPP+ our features

WPP+ target POS + ourfeatures

Performance of 4-class classifiers

- Improve from WPP (Ueffing & Ney, CL 2007) and WPP+targetPOS (Xiong et

al., ACL 2010)

- Our features alone already outperformed previous work

28

59.4 59.3

64.4

63.7

64.4 63.9

66.2 65.6

66.6

65.9

66.8

66.1

58

59

60

61

62

63

64

65

66

67

68

dev test

F-sc

ore

Test sets

Predicting Good/Insertion/Substitution/Shift words

All-Good

WPP (Ueffing&Ney,CL 2007)

WPP+ target POS (Xiong,ACL 2010)

our features

WPP+ our features

WPP+ target POS + ourfeatures

What are the contributions of individual feature sets?

29

69.3

68.7

72.1

71.6 71.4

70.9

69.9

69.5

67

68

69

70

71

72

73

dev test

F-sc

ore

Test sets

Contributions of features for predicting G/B words

WPP (Ueffing&Ney, CL 2007) WPP + source features

WPP+ alignment features WPP+ dependency features

64.4

63.7

66.2

65.7 65.7

65.3

64.9

64.3

62

62.5

63

63.5

64

64.5

65

65.5

66

66.5

dev test

F-sc

ore

Test sets

Contributions of features for predicting G/I/S/Sh

WPP (Ueffing&Ney, CL 2007) WPP + source features

WPP+ alignment features WPP+ dependency features

Where are improvements coming from?

30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Go

od

ne

ss

HTER

Good Bad Linear fit

Goodness vs. HTER

Pearson correlation = 0.6

31

Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments


• Visualizing

32

N-best list reranking

• Reranking n-best list by goodness

33

Dev Test TER BLEU TER BLEU

Baseline 49.9 31 50.2 30.6

2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5



34


Baseline 49.9 31 50.2 30.6




35


Baseline 49.9 31 50.2 30.6


Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments


• Visualizing

36

Designing Objectives

• Minimize post editing efforts in word-level and sentence-level

• Simple and Intuitive

– Inspired by Wordle ©IBM

37

Designing Objectives

• Font size

–Big: bad – Small: good

– Medium: decent

• Color – Red: bad – Black: good – Orange: decent

• Threshold – Good word > 0.8; Bad word < 0.45 – Good sentence > 0.7; Bad sentence < 0.5

38

Bad translations: editing these sentences first

40

the poll also showed that most of the participants in the developing countries are ready

to introduce qualitative changes in the pattern of their lives for the sake of reducing the

effects of climate change.

واظهر االستطالع ايضا ان معظم المشاركين في الدول النامية مستعدون الدخال تغييرات نوعية على نمط حياتهم في

. سبيل خفض تأثيرات التغير المناخي

MT output

Source

the poll also showed that most of the participants in the developing countries are ready

to introduce qualitative changes in the pattern of their lives for the sake of

reducing the effects of climate change.

We predict

and visualize

the survey also showed that most of the participants in developing countries are ready

to introduce changes to the quality of their lifestyle in order to reduce the effects of

climate change .

Human

correction

41

perhaps racism of heads of human diseases and the most difficult treatment, survives,

only used rope,

لعل العنصرية من اوسأ امراض البشرية ومن اصعبها عالجا وال ينجو منها اال من استعان بحبل من هللا

MT output

Source

perhaps racism of heads of human diseases and the most difficult treatment,

survives, only used rope,

We predict

and visualize

racism is but the worst human disease and the hardest to recover ; just he who is

faithful to allah is able to recover . Human

correction

42

Contributions

• Source & Target Dependency Structures

• Source POS and Phrases

• Word Alignment Context

• Experimental results – Our features outperformed features described in the previous work

• Improve MT quality through n-best list reranking

• Visualizing confidence in word and sentence level

43

Thanks!

44

goodness: a method for measuring machine translation ...nbach/papers/acl2011-slides.pdf–good word...

Documents