goodness: a method for measuring machine translation ...nbach/papers/acl2011-slides.pdf–good word...

44
Goodness: A Method for Measuring Machine Translation Confidence Nguyen Bach Fei Huang and Yaser Al-Onaizan Carnegie Mellon University IBM T.J. Watson Research Center 1

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Goodness: A Method for Measuring Machine Translation

Confidence

Nguyen Bach Fei Huang and Yaser Al-Onaizan

Carnegie Mellon University IBM T.J. Watson Research Center

1

Page 2: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

you totally different from zaid amr , and not to deprive yourself in a

basement of imitation and assimilation .

أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب

التقليد والمحاكاة والذوبان

MT output

Source

you totally different from zaid amr , and not to

deprive yourself in a basement of imitation and

assimilation .

We predict

and visualize

you are quite different from zaid and amr , so do not cram yourself in

the tunnel of simulation , imitation and assimilation . Human

correction

2

Page 3: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Why does it matter?

• Current MT systems don’t have efficient ways to predict machine translation confidence.

• Applications – Post-editing

– Commercial practice for end-users

– Down-stream processing: QA, IE

– MT engine: reranking

3

Page 4: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments

• N-best list reranking

• Visualizing

4

Page 5: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

5

Confidence Measure

Back Translation (Gao et al. 2006, Bach et al. 2007)

Word Posterior Probability

(Ueffing&Ney, CL 2007)

Predicting BLEU (Soricut&Echihabi,

ACL 2010)

Binary Classification (Blazt et al., JHU 2003; Quirk, LREC

2004; Xiong, ACL 2010)

Features:

- Source & target LM

- WPP

- IBM Model 1

- Target POS

Our work belongs to this category

Page 6: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Can we do a better job?

• Can we use a rich feature set such as dependency structures, source-side information, and alignment context to improve error prediction performance?

• Can we predict translation error types?

• Do confidence measures help the MT system to select a better translation?

• How confidence score can be presented to improve end-user perception?

6

Page 7: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Confidence Measure Models

• Confidence estimation ≈ sequential labelling task

– Word Sequence = MT output

• Two classifiers

– Binary: Good/Bad

– Multi classes: Insertion/Substitution/Shift/Good.

• Supervised training

• Word -> Sentence confidence

7

Page 8: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Word-level Model

• A feature-rich classifier for a given word

• Discriminatively trained by MIRA

8

where f is its feature vector, y is the label, wy is

its feature weight vector

Page 9: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Sentence-level Model

• Given a sentence S, use word-level models to obtain labels for each words

• Use forward and backward values to compute marginal probability of Good label

• Normalize

9

Page 10: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 1: Source & Target Dependency Structures

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

null

10

Page 11: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 1: Source & Target Dependency Structures

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

Source Parent = an

Target Parent = that

Are parent aligned: Yes

11

Child-Parent Agreement

Page 12: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 1: Source & Target Dependency Structures

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

Children Agreement: 2

12

Page 13: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 2: Source POS and Phrases

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

null

13

Page 14: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 2: Source POS & Phrases

He adds that this process also refers to the inability of the multinational naval forces

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

MT output

Source POS

Source

14

Page 15: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 2: Source POS and Phrases

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

1 if source-POS-sequence = “DT DTNN”

f125(target-word = “process”) =

0 otherwise

He adds that this process also refers to the inability of the multinational naval forces

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

MT output

Source POS

Source

15

Page 16: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 2: Source POS and Phrases

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

He adds that this process also refers to the inability of the multinational naval forces

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

MT output

Source POS

Source

16

Page 17: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces

null

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

MT output

Source POS

Source

Target POS

17

Page 18: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

18

Page 19: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

19

Page 20: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

20

Page 21: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

21

Page 22: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Type 3: Alignment Context

PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS

VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ

wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt

He adds that this process also refers to the inability of the multinational naval forces MT output

Source POS

Source

Target POS

22

Page 23: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Recap

• Source & Target Dependency

• Source POS and Phrases

• Alignment Context

23

Page 24: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Experiment Setup

you totally different from zaid amr , and not to deprive yourself in a basement of imitation

and assimilation .

أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب التقليد والمحاكاة والذوبان

MT output

Source

Human

correction

you are quite different from zaid and amr , so do not cram yourself in the tunnel of

simulation , imitation and assimilation .

24

Page 25: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Experiment Setup

• Training data – Human correction Arabic-English system: 72K sentences with 2.2M words;

mix newswire and weblog

• Dev and Unseen test set – Dev: 2,707 sentences, 80K words – Unseen: 2,707 sentences, 79K words

• WPP: implement WPP algorithm in (Ueffing&Ney, 2007)

• WPP+target-POS: similar feature set as in (Xiong, ACL 2010)

• Learner: MIRA, 100 iterations, hyper-param = 5, cut-off = 1

• Evaluation metric: F-score

25

Page 26: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

How does the proposed method compare with previous work?

26

Page 27: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Performance of binary classifiers

27

59.4 59.3

69.3 68.7

69.6 69.1

72.1 71.5

72.4 72 72.6 72.2

58

60

62

64

66

68

70

72

74

dev test

F-sc

ore

Test sets

Predicting Good/Bad words

All-Good

WPP (Ueffing&Ney,CL 2007)

WPP+ target POS (Xiong,ACL 2010)

our features

WPP+ our features

WPP+ target POS + ourfeatures

Page 28: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Performance of 4-class classifiers

- Improve from WPP (Ueffing & Ney, CL 2007) and WPP+targetPOS (Xiong et

al., ACL 2010)

- Our features alone already outperformed previous work

28

59.4 59.3

64.4

63.7

64.4 63.9

66.2 65.6

66.6

65.9

66.8

66.1

58

59

60

61

62

63

64

65

66

67

68

dev test

F-sc

ore

Test sets

Predicting Good/Insertion/Substitution/Shift words

All-Good

WPP (Ueffing&Ney,CL 2007)

WPP+ target POS (Xiong,ACL 2010)

our features

WPP+ our features

WPP+ target POS + ourfeatures

Page 29: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

What are the contributions of individual feature sets?

29

Page 30: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

69.3

68.7

72.1

71.6 71.4

70.9

69.9

69.5

67

68

69

70

71

72

73

dev test

F-sc

ore

Test sets

Contributions of features for predicting G/B words

WPP (Ueffing&Ney, CL 2007) WPP + source features

WPP+ alignment features WPP+ dependency features

64.4

63.7

66.2

65.7 65.7

65.3

64.9

64.3

62

62.5

63

63.5

64

64.5

65

65.5

66

66.5

dev test

F-sc

ore

Test sets

Contributions of features for predicting G/I/S/Sh

WPP (Ueffing&Ney, CL 2007) WPP + source features

WPP+ alignment features WPP+ dependency features

Where are improvements coming from?

30

Page 31: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Go

od

ne

ss

HTER

Good Bad Linear fit

Goodness vs. HTER

Pearson correlation = 0.6

31

Page 32: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments

• N-best list reranking

• Visualizing

32

Page 33: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

N-best list reranking

• Reranking n-best list by goodness

33

Dev Test TER BLEU TER BLEU

Baseline 49.9 31 50.2 30.6

2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5

Page 34: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

N-best list reranking

• Reranking n-best list by goodness

34

Dev Test TER BLEU TER BLEU

Baseline 49.9 31 50.2 30.6

2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5

Page 35: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

N-best list reranking

• Reranking n-best list by goodness

35

Dev Test TER BLEU TER BLEU

Baseline 49.9 31 50.2 30.6

2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5

Page 36: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Outline

• Introduction

• Predicting

– Review

– Our models

– Feature sets

• Experiments

• N-best list reranking

• Visualizing

36

Page 37: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Designing Objectives

• Minimize post editing efforts in word-level and sentence-level

• Simple and Intuitive

– Inspired by Wordle ©IBM

37

Page 38: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Designing Objectives

• Font size

–Big: bad – Small: good

– Medium: decent

• Color – Red: bad – Black: good – Orange: decent

• Threshold – Good word > 0.8; Bad word < 0.45 – Good sentence > 0.7; Bad sentence < 0.5

38

Page 39: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

39

Page 40: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Bad translations: editing these sentences first

40

Page 41: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

the poll also showed that most of the participants in the developing countries are ready

to introduce qualitative changes in the pattern of their lives for the sake of reducing the

effects of climate change.

واظهر االستطالع ايضا ان معظم المشاركين في الدول النامية مستعدون الدخال تغييرات نوعية على نمط حياتهم في

. سبيل خفض تأثيرات التغير المناخي

MT output

Source

the poll also showed that most of the participants in the developing countries are ready

to introduce qualitative changes in the pattern of their lives for the sake of

reducing the effects of climate change.

We predict

and visualize

the survey also showed that most of the participants in developing countries are ready

to introduce changes to the quality of their lifestyle in order to reduce the effects of

climate change .

Human

correction

41

Page 42: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

perhaps racism of heads of human diseases and the most difficult treatment, survives,

only used rope,

لعل العنصرية من اوسأ امراض البشرية ومن اصعبها عالجا وال ينجو منها اال من استعان بحبل من هللا

MT output

Source

perhaps racism of heads of human diseases and the most difficult treatment,

survives, only used rope,

We predict

and visualize

racism is but the worst human disease and the hardest to recover ; just he who is

faithful to allah is able to recover . Human

correction

42

Page 43: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Contributions

• Source & Target Dependency Structures

• Source POS and Phrases

• Word Alignment Context

• Experimental results – Our features outperformed features described in the previous work

• Improve MT quality through n-best list reranking

• Visualizing confidence in word and sentence level

43

Page 44: Goodness: A Method for Measuring Machine Translation ...nbach/papers/acl2011-slides.pdf–Good word > 0.8; Bad word < 0.45 –Good sentence > 0.7; Bad sentence < 0.5 38

Thanks!

44