edinburgh mt lecture 8: measuring translation accuracy
TRANSCRIPT
![Page 1: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/1.jpg)
Measuring translation accuracy
![Page 2: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/2.jpg)
More has been written about machine translation
evaluation than about machine translation itself.
Yorick Wilks
![Page 3: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/3.jpg)
•Why evaluate?•Rank systems.•Evaluate incremental changes.•Assess new ideas empirically.
•Metric must be objective!
![Page 4: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/4.jpg)
![Page 5: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/5.jpg)
© 2010 IBM Corporation
IBM Research
55
What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud
Winning Human Performance
Winning Human Performance
2007 QA Computer System2007 QA Computer System
Grand Champion Human Performance
Grand Champion Human Performance
Each dot – actual historical human Jeopardy! games
More ConfidentMore Confident Less ConfidentLess Confident
Computers?Not So Good.
![Page 6: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/6.jpg)
© 2010 IBM Corporation
IBM Research
10
Baseline 12/06
v0.1 12/07
v0.3 08/08
v0.5 05/09
v0.6 10/09
v0.8 11/10
v0.4 12/08
DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010
v0.2 05/08
IBM WatsonPlaying in the Winners Cloud
V0.7 04/10
![Page 7: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/7.jpg)
Claim: evaluation by humans is good and ???
![Page 8: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/8.jpg)
Chinese people in the traditional Spring Festival is approaching, the CPC Central Committee this afternoon in Zhongnanhai on the 22nd non-Party
personages to convene a forum in Spring Festival, invited the central committees of democratic parties, the leadership of the National Federation
of Industry and Commerce and personages without party affiliation on behalf of comrades gathered together State yes, talked in length about the
friendship, to greet the Chinese New Year. CPC Central Committee General Secretary and State President and Central Military Commission Chairman
Hu Jintao on behalf of the CPC Central Committee, the State Council, to the central committees of democratic parties, leaders of the National Federation
of Industry and Commerce and personages without party affiliation, to members of the united front, to extend my New Year's blessing.
![Page 9: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/9.jpg)
Design of the WMT Evaluation (2008-2011)system A
system B
system C
system D
system E
system F
system G
} 1. system C
2. system D
3. system A
4. system B
5. system G
6. system F
7. system E
{➡Costly: 361 hours of human effort in 2011.
![Page 10: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/10.jpg)
Design of the WMT Evaluation (2008-2011)system A
system B
system C
system D
system E
system F
system G
} 1. system C
2. system D
3. system A
4. system B
5. system G
6. system F
7. system E
{Are you sure this is the correct ranking?
•In above example, there are 5040 possible rankings.•With 10 systems: 3 million possible rankings.•With 20 systems: 2 quintillion possible rankings.
![Page 11: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/11.jpg)
Design of the WMT Evaluation (2008-2011)system A
system B
system C
system D
system E
system F
system G
reference =
➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.
While (evaluation period is not over):
1. reference2. system C3. system A, system F4. system D
reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F
����
�
�
�
�
�
⌘{WMT Raw Data:
pairwise rankings
![Page 12: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/12.jpg)
Tournamentssystem A system B
system C system D
•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.
Landau, 1951. On dominance relations andthe structure of animal societies
•Used to model all WMT `10-`11 rankings (25 tasks).
![Page 13: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/13.jpg)
Tournamentssystem A system B
system C system D
If tournament is acyclic: topological sort
![Page 14: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/14.jpg)
Tournamentssystem A system B
system C system D
If tournament is acyclic: topological sort
![Page 15: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/15.jpg)
Tournamentssystem A system B
system C
system D
If tournament is acyclic: topological sort
![Page 16: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/16.jpg)
Tournamentssystem A
system B
system C
system D
If tournament is acyclic: topological sort
![Page 17: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/17.jpg)
Tournamentssystem A
system B
system C
system D
If tournament is acyclic: topological sort
![Page 18: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/18.jpg)
Tournamentssystem A system B
system C system D
1
23
1
1
2
What if tournament contains cycles?
16 out of 25 tasks in WMT ’10-’11 contain cycles!
![Page 19: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/19.jpg)
Tournamentssystem A system B
system C system D
1
23
1
1
2
What if tournament contains cycles?
One solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.
(b) Sum of reversed edges weights is minimized.
![Page 20: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/20.jpg)
Tournamentssystem A system B
system C system D
1
23
1
1
2
What if tournament contains cycles?
Solution: Reverse a set of edges such that:(a) Graph is acyclic.
(b) Sum of reversed edges weights is minimized.
![Page 21: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/21.jpg)
Tournamentssystem A system B
system C system D
1
23
1
1
2
What if tournament contains cycles?
Set of reversed edges = minimum feedback arc set (MFAS).In theory, this optimization is NP-hard (Karp, 1972).
In practice, it’s not too hard.
![Page 22: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/22.jpg)
Tournamentssystem A system B
system C system D
1
23
1
1
2
What if tournament contains cycles?
Important detail: What should the weight be?Following analysis uses #(wins - losses).
Dumb, but counts each observation equally.
![Page 23: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/23.jpg)
onlineBrwth-combo
cmu-hyposel-combocambridge
liumdcu-combo
cmu-heafield-comboupv-combo
nrcuedin
jhulimsi
jhu-combolium-combo
ralilig
bbn-comborwth
cmu-statxferonlineAhuicong
dfkicu-zeman
geneva
Example: French-English 2010
Task Rankings
MFAS
![Page 24: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/24.jpg)
Has WMT solved these problems?
Human evaluation is too slow and expensive!
Human evaluation isn’t reproducible!
With crowdsourcing, WMT has made a good dent in this problem.
Empirically true in the WMT data.
![Page 25: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/25.jpg)
Human evaluation is fast and cheap!
![Page 26: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/26.jpg)
美国愿和北韩谈判但拒绝再付出报酬
US willing to negotiate with North Korea but not to pay more compensation.
The United States is willing to hold talks with North Korea but refused to pay
remuneration.
![Page 27: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/27.jpg)
“Progress” postponed because of mechanical hand into the sky.
Launch of “Endeavour” delayed by robotic arm problems.
“奋进”号因机械手故障推迟到升空
![Page 28: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/28.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
![Page 29: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/29.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
![Page 30: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/30.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
![Page 31: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/31.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
ed(i, j) = min
8<
:
ed(i� 1, j) + del(wi)ed(i, j � 1) + ins(w0
j)ed(i� 1, j � 1) + sub(wi, w0
j)
![Page 32: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/32.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
ed(i, j) = min
8<
:
ed(i� 1, j) + del(wi)ed(i, j � 1) + ins(w0
j)ed(i� 1, j � 1) + sub(wi, w0
j)
![Page 33: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/33.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Precision:7/15 tokens = 47%
Recall:7/12 tokens = 58%
![Page 34: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/34.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens
![Page 35: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/35.jpg)
sky very northern shrieked clear wind Although across the the , still was it .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens
![Page 36: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/36.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
![Page 37: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/37.jpg)
sky very northern shrieked clear wind Although across the the , still was it .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens0/14 bigrams0/13 trigrams
![Page 38: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/38.jpg)
very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 3/1 tokens2/2 bigrams1/1 trigrams
![Page 39: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/39.jpg)
very clear . shrieked was still Although wind , across it northern the the sky
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
![Page 40: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/40.jpg)
a north . the was and was the the the though the , the sky
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
![Page 41: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/41.jpg)
BLEU
BP =⇢
1 if c > re1�r/c if c r
Bleu = BP · exp
NX
n=1
wn log pn
!
![Page 42: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/42.jpg)
Details matter
length
BP
![Page 43: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/43.jpg)
Details matter
![Page 44: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/44.jpg)
Influence of BLEU
![Page 45: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/45.jpg)
BLEU-1BLEU-4
BLEU-v11bBLEU-v12
METEOR-v0.6NIST-v11b
TER-v0.7.254-GRRATEC1AmberATEC3ATEC4
Meteor-v0.7TerrorCat
BEwT-EBadger
BadgerLiteBleu-sbpBleuSPCDer
DP-OrDP-OrpDR-OrEDPM
LETMETEOR-ranking
MaxSim
RTERose
SEPIA1SEPIA2
SNRSR-Or
SVM-RankTERpULCh
ULCoptinvWermBLEUmTER
![Page 46: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/46.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
![Page 47: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/47.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swaps
![Page 48: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/48.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swapsHow hard is it to compute this?
![Page 49: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/49.jpg)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swapsHow hard is it to compute this?
ter(i, j) = min
8>><
>>:
ter(i� 1, j) + del(wi)
ter(i, j � 1) + ins(w0j)
ter(i� 1, j � 1) + sub(wi, w0j)
maxk ter(i� 1, [1, ...k � 1, k + 1, ...j]) + 1
![Page 50: Edinburgh MT lecture 8: Measuring translation accuracy](https://reader033.vdocuments.us/reader033/viewer/2022052700/55b3bf20bb61ebd5088b4744/html5/thumbnails/50.jpg)
Automatic evaluation is fast(er) and cheap(er).