wmt14_sakaguchi

54
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation Keisuke Sakaguchi, Matt Post, Benjamin Van Durme ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)

Upload: keisuke-sakaguchi

Post on 16-Aug-2015

85 views

Category:

Presentations & Public Speaking


1 download

TRANSCRIPT

  1. 1. Efcient Elicitation of Annotations for Human Evaluation of Machine TranslationKeisuke Sakaguchi, Matt Post, Benjamin Van Durme ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)
  2. 2. WMT Competition 2 5-way ranking
  3. 3. WMT Competition 3 5-way ranking Pairwise Comparisons
  4. 4. WMT Competition 4 5-way ranking Pairwise Comparisons1System A System E 3System B ...... 12System J System C
  5. 5. Problem 5 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clusters
  6. 6. Problem 6 1System A System E 3System B ...... 12System J System C Needs lots of data (94k) to get good clustersTrueSkill lets us do it with less data (1/3) J
  7. 7. Models 7 1. Expected Wins 2. Hopkins and May 3. TrueSkill
  8. 8. Models 8 1. Expected Wins 2. Hopkins and May 3. TrueSkill 1. rank and cluster with much less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning
  9. 9. Existing Models 9
  10. 10. Expected Wins 10Average relative frequency of winsRanked by the scoreTies are ignored.
  11. 11. Hopkins and May: Overview 11 Source: The cat sat on the couch.311 >= Rank Translation quality
  12. 12. Hopkins and May: Inference 12 N(, 2 ) Relative AbilityVariance (FIXED) Inference!!
  13. 13. TrueSkill 13
  14. 14. 14 Player A #11,000,000 Player B #2900,000 Player C #3850,000 Player D #4800,000 Player E #5790,000 TrueSkill (Herbrich et al. 2006)
  15. 15. TrueSkill (Herbrich et al. 2006) 15 N(, 2 ) System AbilityUncertainty of
  16. 16. TrueSkill (Herbrich et al. 2006) 16 N(, 2 ) System AbilityUncertainty ofInference!!Inference!!
  17. 17. How to update? 17 S1 S2
  18. 18. How to update? 18 Observation: S1 wins S2 Shift and reduce
  19. 19. How much update and ? 19 = Surprisal 2 = 2 (1 2 Surprisal)
  20. 20. Compute Surprisals (for ) 20 S1 wins S2 S1 ties S2t = S1 S2 t = S1 S2 1.0 1.50.5 0.5 1.51.0 0.5 0.00.0 0.51.0 1.0 0.0 1.0 1.00.0 0.5 1.0 1.5
  21. 21. Data Collection 21
  22. 22. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Data Collection in Batch Models 22 Uniform data collection (inefcient)
  23. 23. Data Collection in TrueSkill 23 Match Quality Good Match Bad Match Player T Player S Player UPlayer A Player B #1001 #1002 #1003 #1 #2 1,000,000 900,000 300,000 299,900 299,800
  24. 24. Match Selection 24
  25. 25. Match Selection 25 Select S1 (highest ) S1
  26. 26. Match Selection 26 Compute match probability (with normalizing) pdraw = 0.7 pdraw = 0.3S1
  27. 27. Match Selection 27 Draw a system S2 pdraw = 0.7 pdraw = 0.3 S1 S2
  28. 28. Match Selection 28 Update by observation S1 S2
  29. 29. Experiment 29
  30. 30. Experiment: Setting 30WMT13 dataset (10 language pairs) Annotation: researchers Training size: 400, 800,1600, 3200, 6400Test size: 2,000Evaluation: Perplexity and Accuracy
  31. 31. Probability of pairwise judgment r0 is tuned by held-out development set. r 31 > = < p( |S1, S2) NS1 NS2
  32. 32. 1000 2000 3000 4000 5000 6000 Training Data Size 2.85 2.90 2.95 3.00 Perplexity HopkinsMay TrueSkill Experimental Result: Perplexity 32 H&M showed lower perplexity.
  33. 33. 1000 2000 3000 4000 5000 6000 Training Data Size 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.500 Accuracy ExpWins HopkinsMay TrueSkill Experimental Result: Accuracy 33TrueSkill > Hopkins&May ExpectedWinsTrueSkill: higher acc. with small data
  34. 34. Further Analysis 34
  35. 35. Efcient Data Collection 35First 20%: Uniformly distributedLast 20%: Diagonal = Competitive matchesEfcient usage of judgments by TrueSkill ... Training
  36. 36. Clustering: Experimental Setting 36Bootstrap resamplingRank range (by 95% condence band)Cluster systems (if rank ranges overlap)Training size and the number of clusters
  37. 37. Clustering: Result 37 Efciency of clustering (more clusters, less variances)TrueSkill > ExpectedWins > Hopkins&May
  38. 38. Summary 38 TrueSkill is able to 1. rank and cluster with less data 2. predict pairwise comparisons with higher accuracy 3. be learned by online learning Code is available athttp://github.com/keisks/wmt-trueskill
  39. 39. Future Directions 39Sentence-level quality estimationParameterizing translation difcultyc.f. Item-Response-Theory (2PL)Thanks to the comments from Mark Hopkins
  40. 40. 40
  41. 41. 41 Auxiliary Slides
  42. 42. 42 Exp. WinHopkins&MayTrueSkill LearningBatchBatch+IterationOnline Ties allowedNoYesYes VarianceNoneFixedLearned Model Comparison at a glance
  43. 43. Update comparison: TS vs. HM 43 TrueSkill (Online) H&M (Batch) Iterations
  44. 44. How to update? 44 N(1, 2 1)N(2, 2 2)
  45. 45. How to update? 45 N(1, 2 1)N(2, 2 2) Translations
  46. 46. How to update? 46 Observation: S1 wins S2 Decision radius
  47. 47. How to update? 47
  48. 48. How to update? 48 Observation: S1 draws S2 Decision radius
  49. 49. How to update? 49 Update for each iteration
  50. 50. How to update? 50 Update for each iteration
  51. 51. Clustering: Result 51 Training with 1KTraining with 25KClusters are generated with less amount of training data (c.f. 80K in WMT13 fr-en)Same ordering (TS is stable and accurate.)
  52. 52. 52
  53. 53. 53
  54. 54. Accuracies when training with N-way free-for-all models, xing the number of matches54