![Page 1: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/1.jpg)
BLEU, Its Variants & Its Critics
Arthur Chan
Prepared for Advanced MT Seminar
![Page 2: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/2.jpg)
This Talk
Original BLEU scores (Papineni 2002) Motivation Procedure
NIST: as a major BLEU variant Critics of BLEU
From alternate evaluation metrics METEOR: (Lavie 2004, Banerjee 2005)
From analysis of BLEU (Culy 2002) METEOR will be covered by Alon (next
talk)
![Page 3: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/3.jpg)
Bilingual Evaluation Understudy (BLEU)
![Page 4: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/4.jpg)
Motivation of Automatic Evaluation in MT
Human evaluations of MT weigh many aspects such as Adequacy Fidelity Fluency
Human evaluation are expensive Human evaluation could take a long time
While system need daily change Good automatic evaluation could
save human
![Page 5: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/5.jpg)
BLEU – Why is it Important?
Some reasons: It is proposed by IBM
IBM has a long history of proposing evaluation standards
Verified and Improved by NIST So, its variant is used in evaluation
Widely used Appear everywhere in MT literature after 2001
It is quite useful does give good feedback to the adequacy and
fluency for translation results It is not perfect
It is a subject of criticism (the critics make some sense in this case)
It is a subject of extension
![Page 6: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/6.jpg)
BLEU – Its Motivation
Central Idea: “The closer a machine translation is to a
professional human translation, the better it is.”
Implication A evaluation metric could be evaluated
If it correlates with human evaluation, it would be a useful metric
BLEU was proposed as an aid as a quick substitute of humans when needed
![Page 7: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/7.jpg)
BLEU – What is it? A Big Picture
Require multiple good reference translations
Depends on modified n-gram precision (or co-occurrence) Co-occurrence: if translated sentence hit n-
gram in any reference sentences Per-corpus n-gram co-occurrence is
computed n can has several values and a weighted
sum is computed Brevity of translation is penalized
![Page 8: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/8.jpg)
BLEU – N-gram Precision: a Motivating Example
Candidate 1: It is a guide to action which ensures that the military always obey the commands the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed directions of the party.
![Page 9: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/9.jpg)
BLEU – Modified N-gram Precision
Issues with N-gram precision Give a very good score for over
generated n-gram
![Page 10: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/10.jpg)
BLEU – Brevity Penalty
![Page 11: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/11.jpg)
BLEU – The “Trouble” with Recall
![Page 12: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/12.jpg)
BLEU – Recall and Brevity Penalty
![Page 13: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/13.jpg)
BLEU – Paradigm of Evaluation
![Page 14: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/14.jpg)
BLEU – Evaluation of the Metric
![Page 15: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/15.jpg)
BLEU – The Human Evaluation
![Page 16: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/16.jpg)
BLEU – BLEU vs Human Evaluation
![Page 17: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/17.jpg)
NIST – As a BLEU’s Variant
![Page 18: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/18.jpg)
Usage of BLEU on Character-based Language
![Page 19: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/19.jpg)
Critics of BLEU – From Analysis of BLEU
![Page 20: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/20.jpg)
Critics of BLEU – A Glance of Metrics Beyond BLEU
![Page 21: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/21.jpg)
Critics of BLEU – Summary of BLEU’s Issues
![Page 22: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/22.jpg)
Discussion - Should BLEU be the Standard Metric of MT?
![Page 23: BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d375503460f94a10a9a/html5/thumbnails/23.jpg)
References
Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL-02. 2002
George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.
Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters.
Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation.
Christopher Culy, Susanne Z. Riechemann, The Limits of N-Gram Translation Evaluation Metrics.
Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.