wmt14_sakaguchi

1. Efcient Elicitation of Annotations for Human Evaluation of Machine TranslationKeisuke Sakaguchi, Matt Post, Benjamin Van Durme ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)

2. WMT Competition 2 5-way ranking

3. WMT Competition 3 5-way ranking Pairwise Comparisons

4. WMT Competition 4 5-way ranking Pairwise Comparisons1System A System E 3System B ...... 12System J System C

5. Problem 5 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clusters

6. Problem 6 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clustersTrueSkill lets us do it with less data (1/3) J

7. Models 7 1. Expected Wins 2. Hopkins and May 3. TrueSkill

8. Models 8 1. Expected Wins 2. Hopkins and May 3. TrueSkill 1. rank and cluster with much less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning

9. Existing Models 9

10. Expected Wins 10Average relative frequency of winsRanked by the scoreTies are ignored.

11. Hopkins and May: Overview 11 Source: The cat sat on the couch.311 >= Rank Translation quality

12. Hopkins and May: Inference 12 N(, 2 ) Relative AbilityVariance (FIXED) Inference!!

13. TrueSkill 13

14. 14 Player A #11,000,000 Player B #2900,000 Player C #3850,000 Player D #4800,000 Player E #5790,000 TrueSkill (Herbrich et al. 2006)

15. TrueSkill (Herbrich et al. 2006) 15 N(, 2 ) System AbilityUncertainty of

16. TrueSkill (Herbrich et al. 2006) 16 N(, 2 ) System AbilityUncertainty ofInference!!Inference!!

17. How to update? 17 S1 S2

18. How to update? 18 Observation: S1 wins S2 Shift and reduce

19. How much update and ? 19 = Surprisal 2 = 2 (1 2 Surprisal)

20. Compute Surprisals (for ) 20 S1 wins S2 S1 ties S2t = S1 S2 t = S1 S2 1.0 1.50.5 0.5 1.51.0 0.5 0.00.0 0.51.0 1.0 0.0 1.0 1.00.0 0.5 1.0 1.5

21. Data Collection 21

22. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Data Collection in Batch Models 22 Uniform data collection (inefcient)

23. Data Collection in TrueSkill 23 Match Quality Good Match Bad Match Player T Player S Player UPlayer A Player B #1001 #1002 #1003 #1 #2 1,000,000 900,000 300,000 299,900 299,800

24. Match Selection 24

25. Match Selection 25 Select S1 (highest ) S1

26. Match Selection 26 Compute match probability (with normalizing) pdraw = 0.7 pdraw = 0.3S1

27. Match Selection 27 Draw a system S2 pdraw = 0.7 pdraw = 0.3 S1 S2

28. Match Selection 28 Update by observation S1 S2

29. Experiment 29

30. Experiment: Setting 30WMT13 dataset (10 language pairs) Annotation: researchers Training size: 400, 800,1600, 3200, 6400Test size: 2,000Evaluation: Perplexity and Accuracy

31. Probability of pairwise judgment r0 is tuned by held-out development set. r 31 > = < p( |S1, S2) NS1 NS2

32. 1000 2000 3000 4000 5000 6000 Training Data Size 2.85 2.90 2.95 3.00 Perplexity HopkinsMay TrueSkill Experimental Result: Perplexity 32 H&M showed lower perplexity.

33. 1000 2000 3000 4000 5000 6000 Training Data Size 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.500 Accuracy ExpWins HopkinsMay TrueSkill Experimental Result: Accuracy 33TrueSkill > Hopkins&May ExpectedWinsTrueSkill: higher acc. with small data

34. Further Analysis 34

35. Efcient Data Collection 35First 20%: Uniformly distributedLast 20%: Diagonal = Competitive matchesEfcient usage of judgments by TrueSkill ... Training

36. Clustering: Experimental Setting 36Bootstrap resamplingRank range (by 95% condence band)Cluster systems (if rank ranges overlap)Training size and the number of clusters

37. Clustering: Result 37 Efciency of clustering (more clusters, less variances)TrueSkill > ExpectedWins > Hopkins&May

38. Summary 38 TrueSkill is able to 1. rank and cluster with less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning Code is available athttp://github.com/keisks/wmt-trueskill

39. Future Directions 39Sentence-level quality estimationParameterizing translation difcultyc.f. Item-Response-Theory (2PL)Thanks to the comments from Mark Hopkins

40. 40

41. 41 Auxiliary Slides

42. 42 Exp. WinHopkins&MayTrueSkill LearningBatchBatch+IterationOnline Ties allowedNoYesYes VarianceNoneFixedLearned Model Comparison at a glance

43. Update comparison: TS vs. HM 43 TrueSkill (Online) H&M (Batch) Iterations

44. How to update? 44 N(1, 2 1)N(2, 2 2)

45. How to update? 45 N(1, 2 1)N(2, 2 2) Translations

46. How to update? 46 Observation: S1 wins S2 Decision radius

47. How to update? 47

48. How to update? 48 Observation: S1 draws S2 Decision radius

49. How to update? 49 Update for each iteration

50. How to update? 50 Update for each iteration

51. Clustering: Result 51 Training with 1KTraining with 25KClusters are generated with less amount of training data (c.f. 80K in WMT13 fr-en)Same ordering (TS is stable and accurate.)

52. 52

53. 53

54. Accuracies when training with N-way free-for-all models, xing the number of matches54

wmt14_sakaguchi

Presentations & Public Speaking

system b

system e

system s2 pdraw

s1 highest s1

s1 s2 t

observation s1 s2

player b

match selection