measuring confidence intervals for mt evaluation metrics ying zhang (joy) stephan vogel language...

Measuring Confidence Intervals for MT Evaluation Metrics

Ying Zhang (Joy)Stephan Vogel

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Oct 2004TMI, Baltimore, MD

Ying Zhang, Stephan VogelLTI, Carnegie Mellon University

2

Outline

• Automatic Machine Translation Evaluation– BLEU

– Modified BLEU

– NIST MTEval

• Confidence Intervals based on Bootstrap Percentile– Algorithm

– Comparing two MT systems

– Implementation

• Discussions– How much testing data is needed?

– How many reference translations are needed?

– How many bootstrap samples are needed?



3

Automatic Machine Translation Evaluation

• Subjective MT evaluations– Fluency and Adequacy scored by human judges– Very expensive in time and money

• Objective automatic MT evaluations– Inspired by the Word Error Rate metric used by ASR research– Measuring the “closeness” between the MT hypothesis and human

reference translations– Precision: n-gram precision– Recall:

• Against the best matched reference• Approximated by brevity penalty

– Cheap, fast– Highly correlated with subjective evaluations– MT research has greatly benefited from automatic evaluations– Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST

MTeval, NYU GTM



4

BLEU Metrics

• Proposed by IBM’s SMT group (Papineni et al, 2002)

• Widely used in MT evaluations– DARPA TIDES MT evaluation

– IWSLT evaluation

– TC-Star

• BLEU Metric:

– Pn: Modified n-gram precision

– Geometric mean of p1, p2,..pn

– BP: Brevity penalty

– Usually, N=4 and wn=1/N.

)logexp(1

N

nnn pwBPBLEU

rcif

rcif

eBP cr

)/1(

1 c: length of the MT hypothesis r: effective reference length



5

BLEU Metric

• Example:– MT Hypothesis: the gunman was shot dead by police .

– Reference 1: The gunman was shot to death by the police .

– Reference 2: The gunman was shot to death by the police .

– Reference 3: Police killed the gunman .

– Reference 4: The gunman was shot dead by the police .

• Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)

• Brevity Penalty: c=8, r=9, BP=0.8825• Final Score:

• Usually n-gram precision and BP are calculated on the test set level

68.08825.06.067.086.014



6

Modified BLEU Metric

• BLEU focuses heavily on long n-grams because of the geometric mean

• Example:

• Modified BLEU Metric (Zhang, 2004)

– Arithmetic mean of the n-gram precision– More balanced contribution from different n-grams

N

nnn pwBPBLEUM

1

p1 p2 p3 p4 BLEU

MT1 1.0 0.21 0.11 0.06 0.19

MT2 0.35 0.32 0.28 0.26 0.30



7

NIST MTEval Metric

• Motivation– “Weight more heavily those n-grams that are more informative” (NIST

2002)

– Use a geometric mean of the n-gram score

• Pros: more sensitive than BLEU• Cons:

– Info gain for 2-gram and up is not meaningful• 80% of the score comes from unigram matches• Most matched 5-grams have info gain 0 !

– Score increases when the testing set size increases

N

nhypinwwall

occurcothatwwalln

n

n

wwInfo

BPNIST1

__..._

__..._1

1

1

)1(

)...(

n

nn wwofsoccurrenceofthe

wwofsoccurrenceofthewwInfo

...#

...#log)....(

1

1121



8

Questions Regarding MT Evaluation Metrics

• Do they rank the MT systems in the same way as human judges? – IBM showed a strong correlation between BLEU and human judgments

• How reliable are the automatic evaluation scores?• How sensitive is a metric?

– Sensitivity: the metric should be able to distinguish between systems of similar performance

• Is the metric consistent?– Consistency: the difference between systems is not affected by the

selection of testing/reference data

• How many reference translations are needed?• How much testing data is sufficient for evaluation?

• If we can measure the confidence interval of the evaluation scores, we can answer the above questions



9

Outline

• Overview of Automatic Machine Translation Evaluation– BLEU

– Modified BLEU

– NIST MTEval



– Implementation






10

Measuring the Confidence Intervals

• One BLEU/M-BLEU/NIST score per test set

• How accurate is this score?

• To measure the confidence interval a population is required

• Building a test set with multiple human reference translations is expensive

• Solution: bootstrapping (Efron 1986)– Introduced in 1979 as a computer-based method for estimating the

standard errors of a statistical estimation

– Resampling: creating an artificial population by sampling with replacement

– Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics



11

A Schematic of the Bootstrapping Process

Score0



12

An Efficient Implementation

• Translate and evaluate 2,000 test sets?– No Way!

• Resample the n-gram precision information for the sentences– Most MT systems are context independent at the sentence level;– MT evaluation metrics are based on information collected for each testing

sentences– E.g. for BLEU/M-BLEU and NIST

RefLen: 17 20 19 24ClosestRefLen 171-gram: 15 10 89.342-gram: 14 4 9.043-gram: 13 3 3.654-gram: 12 2 2.43

– Similar for human judgment and other MT metrics

• Approximation for NIST information gain• Scripts available at: http://projectile.is.cs.cmu.edu/research/public/tools/

bootStrap/tutorial.htm

http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm












13

Algorithm

Original test suite T0 with N segments and R reference translations

Represent the i-th segment of T0 as an n-tuple:

T0[i]=<si, ri1,ri2,..,riR>

for(b=1;b<=B;b++){

for(i=1;i<=N;i++){

s = random(1,N);

Tb[i] = T0[s];

}

Calculating BLEU/M-BLEU/NIST for Tb

}

Sort B BLEU/M-BLEU/NIST scores

Output scores ranked 2.5%th and 97.5%



14

Confidence Intervals

• 7 Chinese-English MT systems from June 2002 TIDES evaluation

• Observations:– Relative confidence interval: NIST<M-Bleu<Bleu– NIST scores have more discriminative powers than BLEU– The strong impact of long n-grams makes the BLEU score less stable

or … introduces more noise)



15

Are Two MT Systems Different?

• Comparing two MT systems’ performance– Using the similar method as for single system

– E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]

– If the confidence intervals overlap with 0, two systems are not significantly different

• M-Bleu and NIST have more discriminative power than Bleu

• Automatic metrics have pretty high correlations with the human ranking

• Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not



16

Outline

• Overview of Automatic Machine Translation Evaluation– BLEU

– Modified BLEU

– NIST MTEval



– Implementation




– Non-parametric interval or normal/t-intervals?



17

How much testing data is needed

NIST Scores

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data Size

NIS

T S

co

re

A B C D E F G

BLEU Scores

0

0.05

0.1

0.15

0.2

0.25

0.3

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

BL

EU

Sc

ore

A B C D E F G

M-Bleu Scores

0.05

0.07

0.09

0.11

0.13

0.15

0.17

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

M-B

leu

sc

ore

A B C D E F G

F+A Human Judgments based on Different Size of Testing Data

4

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

6

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data

Hu

ma

n J

ud

gm

en

t

A B C D E F G



18

How much testing data is needed

• NIST scores increase steadily with the growing test set size

• The distance between the scores of the different systems remains stable when using 40% or more of the test set

• The confidence intervals become narrower for larger test set

• Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified)

* System A, (Bootstrap Size B=2000)



19

Effects of Using Multiple References

• Single reference from one translator may favor some systems

• Increasing the number of references narrows down the relative confidence interval

0

0.05

0.1

0.15

0.2

0.25

CE01 CE02 CE03 CE04 CE05 CE06 CE07MT systems

BL

EU

sc

ore REF01

REF02

REF03

REF04

4-REF



20

How Many Reference Translations are Sufficient?

• Confidence intervals become narrower with more reference translations

• [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref)

• One additional reference translation compensates for 10~15% of testing data

* System A, (Bootstrap Size B=2000)



21

Do We Really Need Multiple References?

• Parallel multiple reference

• Single reference from multiple translators*

– Reduced bias from different translators

– Yields the same confidence interval/reliability as the parallel multiple reference

– Costs only half of the effort compared to building a parallel multiple reference set

*Originally proposed in IBM’s BLEU report



22

Single Reference from Multiple Translators

• Reduced bias by mixing from different translators

• Yields the same confidence intervals

0

0.05

0.1

0.15

0.2

0.25

CE01 CE02 CE03 CE04 CE05 CE06 CE07MT Systems

BL

EU

Sc

ore mixedREF1

mixedREF2

mixedREF3

mixedREF4

mixedREF5

mixedREF6

mixedREF7

mixedREF8

4-REF



23

Bootstrap-t Interval vs. Normal/t Interval

• Normal distribution / t-distribution

• Student’s t-interval (when n is small)

• Bootstrap-t interval– For each bootstrap sample, calculate

– The alpha-th percentile is estimated by the value , such that

– Bootstrap-t interval is

– e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval

)1,0(~ˆ .

^ Nse

Z

]ˆ,ˆ[^

)(^

)1( sezsez Assuming that

1

.

^ ~ˆ

ntse

Z

]ˆ,ˆ[^

1

^

1

)()1(

setset nn

Assuming that

^*

**

)(

)(ˆ)(

bse

bbZ

)(ˆ t BtbZ /}ˆ)({# )(*

]ˆ,ˆ[^^ )()1(

setset



24

Bootstrap-t interval vs. Normal/t interval (Cont.)

• Bootstrap-t intervals assumes no distribution, but– It can give erratic results– It can be heavily influenced by a few outlying data points

• When B is large, the bootstrap sample scores are pretty close to normal distribution

• Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)– STDEV=0.27 for bootstrap-t interval– STDEV=0.14 for normal/student-t interval

Historgram of 2000 BLEU Scores

0

20

40

60

80

100

120

140

160

BLEU Score

Fre

q



25

The Number of Bootstrap Replications B

• Ideal bootstrap estimate of the confidence interval takes B• Computational time increases linearly with B • The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative

confidence interval– STDEV = 0.60 when B=100; STDEV = 0.27 when B=500

• Two rules of thumb:– Even a small B, say B=100 is usually informative– B>1000 gives quite satisfactory results



26

Conclusions

• Using bootstrapping method to measure the confidence intervals for MT evaluation metrics

• Using confidence intervals to study the characteristics of an MT evaluation metric– Correlation with human judgments

– Sensitivity

– Consistency

• Modified BLEU is a better metric than BLEU

• Single reference from multiple translators is as good as parallel multiple references and costs only half the effort



27

References

• Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.

• F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.

• M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.

• G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.

• I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.

• King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.

• S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.

• NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf

• Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.

• Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.



28

Questions and Comments?



29

N-gram Contributions to NIST Score

measuring confidence intervals for mt evaluation metrics ying zhang (joy) stephan vogel language...

Documents