wmt14_sakaguchi
TRANSCRIPT
- 1. Efcient Elicitation of Annotations for Human Evaluation of Machine TranslationKeisuke Sakaguchi, Matt Post, Benjamin Van Durme ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)
- 2. WMT Competition 2 5-way ranking
- 3. WMT Competition 3 5-way ranking Pairwise Comparisons
- 4. WMT Competition 4 5-way ranking Pairwise Comparisons1System A System E 3System B ...... 12System J System C
- 5. Problem 5 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clusters
- 6. Problem 6 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clustersTrueSkill lets us do it with less data (1/3) J
- 7. Models 7 1. Expected Wins 2. Hopkins and May 3. TrueSkill
- 8. Models 8 1. Expected Wins 2. Hopkins and May 3. TrueSkill 1. rank and cluster with much less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning
- 9. Existing Models 9
- 10. Expected Wins 10Average relative frequency of winsRanked by the scoreTies are ignored.
- 11. Hopkins and May: Overview 11 Source: The cat sat on the couch.311 >= Rank Translation quality
- 12. Hopkins and May: Inference 12 N(, 2 ) Relative AbilityVariance (FIXED) Inference!!
- 13. TrueSkill 13
- 14. 14 Player A #11,000,000 Player B #2900,000 Player C #3850,000 Player D #4800,000 Player E #5790,000 TrueSkill (Herbrich et al. 2006)
- 15. TrueSkill (Herbrich et al. 2006) 15 N(, 2 ) System AbilityUncertainty of
- 16. TrueSkill (Herbrich et al. 2006) 16 N(, 2 ) System AbilityUncertainty ofInference!!Inference!!
- 17. How to update? 17 S1 S2
- 18. How to update? 18 Observation: S1 wins S2 Shift and reduce
- 19. How much update and ? 19 = Surprisal 2 = 2 (1 2 Surprisal)
- 20. Compute Surprisals (for ) 20 S1 wins S2 S1 ties S2t = S1 S2 t = S1 S2 1.0 1.50.5 0.5 1.51.0 0.5 0.00.0 0.51.0 1.0 0.0 1.0 1.00.0 0.5 1.0 1.5
- 21. Data Collection 21
- 22. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Data Collection in Batch Models 22 Uniform data collection (inefcient)
- 23. Data Collection in TrueSkill 23 Match Quality Good Match Bad Match Player T Player S Player UPlayer A Player B #1001 #1002 #1003 #1 #2 1,000,000 900,000 300,000 299,900 299,800
- 24. Match Selection 24
- 25. Match Selection 25 Select S1 (highest ) S1
- 26. Match Selection 26 Compute match probability (with normalizing) pdraw = 0.7 pdraw = 0.3S1
- 27. Match Selection 27 Draw a system S2 pdraw = 0.7 pdraw = 0.3 S1 S2
- 28. Match Selection 28 Update by observation S1 S2
- 29. Experiment 29
- 30. Experiment: Setting 30WMT13 dataset (10 language pairs) Annotation: researchers Training size: 400, 800,1600, 3200, 6400Test size: 2,000Evaluation: Perplexity and Accuracy
- 31. Probability of pairwise judgment r0 is tuned by held-out development set. r 31 > = < p( |S1, S2) NS1 NS2
- 32. 1000 2000 3000 4000 5000 6000 Training Data Size 2.85 2.90 2.95 3.00 Perplexity HopkinsMay TrueSkill Experimental Result: Perplexity 32 H&M showed lower perplexity.
- 33. 1000 2000 3000 4000 5000 6000 Training Data Size 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.500 Accuracy ExpWins HopkinsMay TrueSkill Experimental Result: Accuracy 33TrueSkill > Hopkins&May ExpectedWinsTrueSkill: higher acc. with small data
- 34. Further Analysis 34
- 35. Efcient Data Collection 35First 20%: Uniformly distributedLast 20%: Diagonal = Competitive matchesEfcient usage of judgments by TrueSkill ... Training
- 36. Clustering: Experimental Setting 36Bootstrap resamplingRank range (by 95% condence band)Cluster systems (if rank ranges overlap)Training size and the number of clusters
- 37. Clustering: Result 37 Efciency of clustering (more clusters, less variances)TrueSkill > ExpectedWins > Hopkins&May
- 38. Summary 38 TrueSkill is able to 1. rank and cluster with less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning Code is available athttp://github.com/keisks/wmt-trueskill
- 39. Future Directions 39Sentence-level quality estimationParameterizing translation difcultyc.f. Item-Response-Theory (2PL)Thanks to the comments from Mark Hopkins
- 40. 40
- 41. 41 Auxiliary Slides
- 42. 42 Exp. WinHopkins&MayTrueSkill LearningBatchBatch+IterationOnline Ties allowedNoYesYes VarianceNoneFixedLearned Model Comparison at a glance
- 43. Update comparison: TS vs. HM 43 TrueSkill (Online) H&M (Batch) Iterations
- 44. How to update? 44 N(1, 2 1)N(2, 2 2)
- 45. How to update? 45 N(1, 2 1)N(2, 2 2) Translations
- 46. How to update? 46 Observation: S1 wins S2 Decision radius
- 47. How to update? 47
- 48. How to update? 48 Observation: S1 draws S2 Decision radius
- 49. How to update? 49 Update for each iteration
- 50. How to update? 50 Update for each iteration
- 51. Clustering: Result 51 Training with 1KTraining with 25KClusters are generated with less amount of training data (c.f. 80K in WMT13 fr-en)Same ordering (TS is stable and accurate.)
- 52. 52
- 53. 53
- 54. Accuracies when training with N-way free-for-all models, xing the number of matches54