pp2`pb2r q7 i?2 9i? qq`fb?qt qm bb m h` mbh ibqm · proceedings of the 4th workshop on asian...
TRANSCRIPT
Proceedings of the 4th Workshop on Asian Translation, pages 1–54,Taipei, Taiwan, November 27, 2017. c©2017 AFNLP
Overview of the 4th Workshop on Asian TranslationToshiaki Nakazawa
Japan Science andTechnology Agency
Shohei Higashiyama and Chenchen DingNational Institute of
Information andCommunications Technology
{shohei.higashiyama, chenchen.ding}@nict.go.jp
Hideya Mino and Isao GotoNHK
{mino.h-gq, goto.i-es}@nhk.or.jp
Hideto KazawaGoogle
Yusuke OdaNara Institute of
Science and [email protected]
Graham NeubigCarnegie Mellon University
Sadao KurohashiKyoto University
AbstractThis paper presents the results ofthe shared tasks from the 4th work-shop on Asian translation (WAT2017)including J↔E, J↔C scientific pa-per translation subtasks, C↔J, K↔J,E↔J patent translation subtasks,H↔E mixed domain subtasks, J↔Enewswire subtasks and J↔E recipesubtasks. For the WAT2017, 12 in-stitutions participated in the sharedtasks. About 300 translation resultshave been submitted to the automaticevaluation server, and selected submis-sions were manually evaluated.
1 IntroductionThe Workshop on Asian Translation (WAT)is a new open evaluation campaign focusingon Asian languages. Following the success ofthe previous workshops WAT2014 (Nakazawaet al., 2014), WAT2015 (Nakazawa et al.,2015) and WAT2016 (Nakazawa et al., 2016),WAT2017 brings together machine translationresearchers and users to try, evaluate, shareand discuss brand-new ideas of machine trans-lation. We have been working toward practi-cal use of machine translation among all Asiancountries.
For the 4th WAT, we adopted new transla-tion subtasks with English-Japanese news cor-pus and English-Japanese recipe corpus in ad-dition to the subtasks at WAT2016 1. Fur-
1This year we did not conduct Indonesian-Englishnewswire subtask, which is conducted in WAT2016,due to corpus license reasons.
thermore, we invited research papers on top-ics related to machine translation, especiallyfor Asian languages. The submitted researchpapers were peer reviewed by three programcommittee members and the committee ac-cepted 4 papers, which focus on on neural ma-chine translation, and construction and evalu-ation of language resources. We also launchedthe small NMT task, which aims to build asmall NMT system that keeps a reasonabletranslation quality. There are, however, nosubmissions to the task this year.
WAT is the uniq workshop on Asian lan-guage transration with the following charac-teristics:
• Open innovation platformDue to the fixed and open test data, wecan repeatedly evaluate translation sys-tems on the same dataset over years.There is no deadline of translation re-sult submission with respect to auto-matic evaluation of translation qualityand WAT receives submissions at anytime.
• Domain and language pairsWAT is the world’s first workshopthat targets scientific paper do-main, and Chinese↔Japanese andKorean↔Japanese language pairs. In thefuture, we will add more Asian languagessuch as Vietnamese, Thai, Burmese andso on.
• Evaluation methodEvaluation is done both automatically
1
Lang Train Dev DevTest TestJE 3,008,500 1,790 1,784 1,812JC 672,315 2,090 2,148 2,107
Table 1: Statistics for ASPEC.
and manually. For automatic evalua-tion, we use three metrics: BLEU, RIBESand AMFM. As human evaluation, weevaluate translation results by pairwiseevaluation and JPO adequacy evaluation.JPO adequacy evaluation is conducted forthe selected submissions according to thepairwise evaluation results.
2 DatasetWAT2017 uses the Asian Scientific Paper Ex-cerpt Corpus (ASPEC) 2, JPO Patent Corpus(JPC) 3, JIJI Corpus 4, IIT Bombay English-Hindi Corpus (IITB Corpus) 5 and RecipeCorpus 6 as the dataset.
2.1 ASPECASPEC was constructed by the Japan Scienceand Technology Agency (JST) in collaborationwith the National Institute of Information andCommunications Technology (NICT). Thecorpus consists of a Japanese-English sci-entific paper abstract corpus (ASPEC-JE),which is used for J↔E subtasks, and aJapanese-Chinese scientific paper excerpt cor-pus (ASPEC-JC), which is used for J↔C sub-tasks. The statistics for each corpus are shownin Table 1.
2.1.1 ASPEC-JEThe training data for ASPEC-JE was con-structed by NICT from approximately twomillion Japanese-English scientific paper ab-stracts owned by JST. The data is a compara-ble corpus and sentence correspondences arefound automatically using the method from(Utiyama and Isahara, 2007). Each sentence
2http://lotus.kuee.kyoto-u.ac.jp/ASPEC/index.html
3http://lotus.kuee.kyoto-u.ac.jp/WAT/patent/index.html
4http://lotus.kuee.kyoto-u.ac.jp/WAT/jiji-corpus/index.html
5http://www.cfilt.iitb.ac.in/iitb_parallel/index.html6http://lotus.kuee.kyoto-u.ac.jp/WAT/recipe-
corpus/index.html
pair is accompanied by a similarity score thatare calculated by the method and a field IDthat indicates a scientific field. The corre-spondence between field IDs and field names,along with the frequency and occurrence ra-tios for the training data, are descripted in theREADME file of ASPEC-JE.
The development, development-test andtest data were extracted from parallel sen-tences from the Japanese-English paper ab-stracts that exclude the sentences in the train-ing data. Each dataset consists of 400 docu-ments and contains sentences in each field atthe same rate. The document alignment wasconducted automatically and only documentswith a 1-to-1 alignment are included. It istherefore possible to restore the original docu-ments. The format is the same as the trainingdata except that there is no similarity score.
2.1.2 ASPEC-JCASPEC-JC is a parallel corpus consisting ofJapanese scientific papers, which come fromthe literature database and electronic journalsite J-STAGE by JST, and their translation toChinese with permission from the necessaryacademic associations. Abstracts and para-graph units are selected from the body textso as to contain the highest overall vocabularycoverage.
The development, development-test andtest data are extracted at random from docu-ments containing single paragraphs across theentire corpus. Each set contains 400 para-graphs (documents). There are no documentssharing the same data across the training, de-velopment, development-test and test sets.
2.2 JPCJPC was constructed by the Japan Patent Of-fice (JPO). The corpus consists of Chinese-Japanese patent description corpus (JPC-CJ),Korean-Japanese patent description corpus(JPC-KJ) and English-Japanese patent de-scription corpus (JPC-EJ) with the sectionsof Chemistry, Electricity, Mechanical engi-neering, and Physics on the basis of Interna-tional Patent Classification (IPC). Each cor-pus is partitioned into training, development,development-test and test data. This corpusis used for patent subtasks C↔J, K↔J andE↔J. The statistics for each corpus are shown
2
Lang Train Dev DevTest TestCJ 1,000,000 2,000 2,000 2,000KJ 1,000,000 2,000 2,000 2,000EJ 1,000,000 2,000 2,000 2,000
Table 2: Statistics for JPC.
in Table 2.The Sentence pairs in each data were ran-
domly extracted from a description part ofcomparable patent documents under the con-dition that a similarity score between two sen-tences is greater than or equal to the thresholdvalue 0.05. The similarity score was calculatedby the method from (Utiyama and Isahara,2007) as with ASPEC. Document pairs whichwere used to extract sentence pairs for eachdata were not used for the other data. Fur-thermore, the sentence pairs were extractedso as to be the same number among the foursections. The maximize number of sentencepairs which are extracted from one documentpair was limited to 60 for training data and20 for the development, development-test andtest data.
The training data for JPC-CJ was madewith sentence pairs of Chinese-Japanesepatent documents published in 2012. ForJPC-KJ and JPC-EJ, the training data wasextracted from sentence pairs of Korean-Japanese and English-Japanese patent docu-ments published in 2011 and 2012. The de-velopment, development-test and test data forJPC-CJ, JPC-KJ and JPC-EJ were respec-tively made with 100 patent documents pub-lished in 2013.
2.3 JIJI CorpusJIJI Corpus was constructed by Jiji Press, Ltd.in collaboration with NICT. The corpus con-sists of news text that comes from Jiji Pressnews of various categories including politics,economy, nation, business, markets, sportsand so on. The corpus is partitioned intotraining, development, development-test andtest data, which consists of Japanese-Englishsentence pairs. The statistics for each corpusare shown in Table 3.
The sentence pairs in each data are identi-fied in the same manner as that for ASPEC
Lang Train Dev DevTest TestEJ 200,000 2,000 2,000 2,000
Table 3: Statistics for JIJI Corpus.
Lang Train Dev Test MonoH – – – 45,075,279EH 1,492,827 520 2,507 –JH 152,692 1,566 2,000 –
Table 4: Statistics for IITB Corpus. “Mono”indicates monolingual Hindi corpus.
using the method from (Utiyama and Isahara,2007).
2.4 IITB CorpusIIT Bombay English-Hindi corpus containsEnglish-Hindi parallel corpus (IITB-EH) aswell as monolingual Hindi corpus collectedfrom a variety of sources and corpora devel-oped at the Center for Indian Language Tech-nology, IIT Bombay over the years. This cor-pus is used for mixed domain subtasks H↔E.Furthermore, mixed domain subtasks H↔Jwere added as a pivot language task with aparallel corpus created using publicly availablecorpora (IITB-JH) 7. Most sentence pairs inIITB-JH come from the Bible corpus. Thestatistics for each corpus are shown in Table4.
2.5 Recipe CorpusRecipe Corpus was constructed by CookpadInc. Each recipe consists of a title, ingredi-ents, steps, a description and a history. Everytext in titles, ingredients and steps consists ofa parallel sentence while one in descriptionsand histories is not always a parallel sentence.Although all of the texts in the training set canbe used for training, only titles, ingredientsand steps in the test set is used for evaluation.The statistics for each corpus are described inTable 5.
3 Baseline SystemsHuman evaluations were conducted as pair-wise comparisons between the translation re-sults for a specific baseline system and trans-lation results for each participant’s system.
7http://lotus.kuee.kyoto-u.ac.jp/WAT/Hindi-corpus/WAT2017-Ja-Hi.zip
3
Lang TextType Train Dev DevTest Test
EJTitle 14,779 500 500 500Ingredient 127,244 4,274 4,188 3,935Step 108,993 3,303 3,086 2,804
Table 5: Statistics for Recipe Corpus.
That is, the specific baseline system was thestandard for human evaluation. A phrase-based statistical machine translation (SMT)system was adopted as the specific baselinesystem at WAT 2017, which is the same sys-tem as that at WAT 2014 to WAT 2016.
In addition to the results for the baselinephrase-based SMT system, we produced re-sults for the baseline systems that consistedof a hierarchical phrase-based SMT system,a string-to-tree syntax-based SMT system,a tree-to-string syntax-based SMT system,seven commercial rule-based machine transla-tion (RBMT) systems, and two online trans-lation systems. We also experimentally pro-duced results for the baseline systems thatconsisted of an neural machine translation sys-tem using the implementation of (Vaswaniet al., 2017). The SMT baseline systems con-sisted of publicly available software, and theprocedures for building the systems and fortranslating using the systems were publishedon the WAT web page8. We used Moses(Koehn et al., 2007; Hoang et al., 2009) as theimplementation of the baseline SMT systems.The Berkeley parser (Petrov et al., 2006) wasused to obtain syntactic annotations. Thebaseline systems are shown in Table 6.
The commercial RBMT systems and the on-line translation systems were operated by theorganizers. We note that these RBMT com-panies and online translation companies didnot submit themselves. Because our objectiveis not to compare commercial RBMT systemsor online translation systems from companiesthat did not themselves participate, the sys-tem IDs of these systems are anonymous inthis paper.
8http://lotus.kuee.kyoto-u.ac.jp/WAT/
4
ASP
ECJP
CII
TB
JIJI
REC
IPE
Syst
emID
Syst
emT
ype
JEEJ
JCC
JJE
EJJC
CJ
JKK
JHE
EHJE
EJJE
EJSM
TPh
rase
Mos
es’P
hras
e-ba
sed
SMT
SMT
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
SMT
Hie
roM
oses
’Hie
rarc
hica
lPhr
ase-
base
dSM
TSM
T✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓SM
TS2
TM
oses
’Str
ing-
to-T
ree
Synt
ax-b
ased
SMT
and
Ber
kele
ypa
rser
SMT
✓✓
✓✓
✓✓
SMT
T2S
Mos
es’T
ree-
to-S
trin
gSy
ntax
-bas
edSM
Tan
dB
erke
ley
pars
erSM
T✓
✓✓
✓✓
✓R
BM
TX
The
Hon
yaku
V15
(Com
mer
cial
syst
em)
RB
MT
✓✓
✓✓
✓✓
✓✓
RB
MT
XAT
LAS
V14
(Com
mer
cial
syst
em)
RB
MT
✓✓
✓✓
RB
MT
XPA
T-T
rans
er20
09(C
omm
erci
alsy
stem
)R
BM
T✓
✓✓
✓R
BM
TX
PC-T
rans
erV
13(C
omm
erci
alsy
stem
)R
BM
T✓
✓✓
✓R
BM
TX
J-B
eijin
g7
(Com
mer
cial
syst
em)
RB
MT
✓✓
✓✓
RB
MT
XH
ohra
i201
1(C
omm
erci
alsy
stem
)R
BM
T✓
✓✓
RB
MT
XJ
Soul
9(C
omm
erci
alsy
stem
)R
BM
T✓
✓R
BM
TX
Kor
ai20
11(C
omm
erci
alsy
stem
)R
BM
T✓
✓O
nlin
eX
Goo
gle
tran
slate
Oth
er✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓O
nlin
eX
Bin
gtr
ansla
tor
Oth
er✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓A
IAY
NG
oogl
e’s
impl
emen
tatio
nof
“Att
entio
nIs
All
You
Nee
d”N
MT
✓✓
Tabl
e6:
Base
line
Syst
ems
5
3.1 Training DataWe used the following data for training theSMT baseline systems.
• Training data for the language model: Allof the target language sentences in theparallel corpus.
• Training data for the translation model:Sentences that were 40 words or less inlength. (For ASPEC Japanese–Englishtraining data, we only used train-1.txt,which consists of one million parallel sen-tence pairs with high similarity scores.)
• Development data for tuning: All of thedevelopment data.
3.2 Common Settings for BaselineSMT
We used the following tools for tokenization.
• Juman version 7.09 for Japanese segmen-tation.
• Stanford Word Segmenter version 2014-01-0410 (Chinese Penn Treebank (CTB)model) for Chinese segmentation.
• The Moses toolkit for English and Indone-sian tokenization.
• Mecab-ko11 for Korean segmentation.• Indic NLP Library12 for Hindi segmenta-
tion.
To obtain word alignments, GIZA++ andgrow-diag-final-and heuristics were used. Weused 5-gram language models with modifiedKneser-Ney smoothing, which were built us-ing a tool in the Moses toolkit (Heafield et al.,2013).
3.3 Phrase-based SMTWe used the following Moses configuration forthe phrase-based SMT system.
• distortion-limit– 20 for JE, EJ, JC, and CJ– 0 for JK, KJ, HE, and EH– 6 for IE and EI
• msd-bidirectional-fe lexicalized reorder-ing
9http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN
10http://nlp.stanford.edu/software/segmenter.shtml11https://bitbucket.org/eunjeon/mecab-ko/12https://bitbucket.org/anoopk/indic_nlp_library
• Phrase score option: GoodTuring
The default values were used for the other sys-tem parameters.
3.4 Hierarchical Phrase-based SMTWe used the following Moses configuration forthe hierarchical phrase-based SMT system.
• max-chart-span = 1000• Phrase score option: GoodTuring
The default values were used for the other sys-tem parameters.
3.5 String-to-Tree Syntax-based SMTWe used the Berkeley parser to obtain tar-get language syntax. We used the follow-ing Moses configuration for the string-to-treesyntax-based SMT system.
• max-chart-span = 1000• Phrase score option: GoodTuring• Phrase extraction options: MaxSpan =
1000, MinHoleSource = 1, and NonTerm-ConsecSource.
The default values were used for the other sys-tem parameters.
3.6 Tree-to-String Syntax-based SMTWe used the Berkeley parser to obtain sourcelanguage syntax. We used the following Mosesconfiguration for the baseline tree-to-stringsyntax-based SMT system.
• max-chart-span = 1000• Phrase score option: GoodTuring• Phrase extraction options: MaxSpan =
1000, MinHoleSource = 1, MinWords =0, NonTermConsecSource, and AllowOn-lyUnalignedWords.
The default values were used for the other sys-tem parameters.
4 Automatic Evaluation4.1 Procedure for Calculating
Automatic Evaluation ScoreWe evaluated translation results by three met-rics: BLEU (Papineni et al., 2002), RIBES(Isozaki et al., 2010) and AMFM (Banchset al., 2015). BLEU scores were calculated us-ing multi-bleu.perl which was distributed
6
with the Moses toolkit (Koehn et al., 2007).RIBES scores were calculated using RIBES.pyversion 1.02.4 13. AMFM scores were calcu-lated using scripts created by the technical col-laborators of WAT2017. All scores for eachtask were calculated using the correspondingreference.
Before the calculation of the automatic eval-uation scores, the translation results were to-kenized with word segmentation tools for eachlanguage. For Japanese segmentation, we usedthree different tools: Juman version 7.0 (Kuro-hashi et al., 1994), KyTea 0.4.6 (Neubig et al.,2011) with Full SVM model 14 and MeCab0.996 (Kudo, 2005) with IPA dictionary 2.7.015. For Chinese segmentation, we used twodifferent tools: KyTea 0.4.6 with Full SVMModel in MSR model and Stanford Word Seg-menter (Tseng, 2005) version 2014-06-16 withChinese Penn Treebank (CTB) and PekingUniversity (PKU) model 16. For Korean seg-mentation we used mecab-ko 17. For Englishsegmentation, we used tokenizer.perl 18 inthe Moses toolkit. For Hindi segmentation,we used Indic NLP Library 19. The detailedprocedures for the automatic evaluation areshown on the WAT2017 evaluation web page20.
4.2 Automatic Evaluation SystemThe participants submit translation results viaan automatic evaluation system deployed onthe WAT2017 web page, which automaticallygives evaluation scores for the uploaded re-sults. Figure 1 shows the submission inter-face for participants. The system requires par-ticipants to provide the following informationwhen they upload translation results:
• Subtask:Scientific papers subtask (J↔E, J↔C),Patents subtask (C↔J, K↔J, E↔J),
13http://www.kecl.ntt.co.jp/icl/lirg/ribes/index.html14http://www.phontron.com/kytea/model.html15http://code.google.com/p/mecab/downloads/detail?
name=mecab-ipadic-2.7.0-20070801.tar.gz16http://nlp.stanford.edu/software/segmenter.shtml17https://bitbucket.org/eunjeon/mecab-ko/18https://github.com/moses-
smt/mosesdecoder/tree/RELEASE-2.1.1/scripts/tokenizer/tokenizer.perl
19https://bitbucket.org/anoopk/indic_nlp_library20http://lotus.kuee.kyoto-
u.ac.jp/WAT/evaluation/index.html
Newswire subtask (J↔E),Mixed domain subtask (H↔E, H↔J) orRecipe subtask (J↔E);
• Method:SMT, RBMT, SMT and RBMT, EBMT,NMT or Other;
• Use of other resources in addition to theprovided data ASPEC / JPC / IITB Cor-pus / JIJI Corpus / Recipe Corpus;
• Permission to publish automatic evalua-tion scores on the WAT2017 web page.
Although participants can confirm only theinformation that they filled or uploaded, theserver for the system stores all submitted in-formation including translation results andscores. Information about translation resultsthat participants permit to be published is dis-closed via the WAT2017 evaluation web page.Participants can also submit the results for hu-man evaluation using the same web interface.This automatic evaluation system will remainavailable even after WAT2017. Anybody canregister an account for the system by followingthe procesures in the registration web page 21.
21http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/registration/index.html
7
Figu
re1:
The
subm
issio
nwe
bpa
gefo
rpa
rtic
ipan
ts
8
5 Human Evaluation
In WAT2017, we conducted 2 kinds of humanevaluations: pairwise evaluation and JPO ad-equacy evaluation.
5.1 Pairwise EvaluationThe pairwise evaluation is the same as thelast year, but not using the crowdsourcing thisyear. We asked professional translation com-pany to do pairwise evaluation. The cost ofpairwise evaluation per sentence is almost thesame to that of last year.
We randomly chose 400 sentences from theTest set for the pairwise evaluation. We usedthe same sentences as the last year for thecontinuous subtasks. Each submission is com-pared with the baseline translation (Phrase-based SMT, described in Section 3) and givena Pairwise score.
5.1.1 Pairwise Evaluation of SentencesWe conducted pairwise evaluation of each ofthe 400 test sentences. The input sentenceand two translations (the baseline and a sub-mission) are shown to the annotators, and theannotators are asked to judge which of thetranslation is better, or if they are of the samequality. The order of the two translations areat random.
5.1.2 VotingTo guarantee the quality of the evaluations,each sentence is evaluated by 5 different anno-tators and the final decision is made dependingon the 5 judgements. We define each judge-ment ji(i = 1, · · · , 5) as:
ji =
1 if better than the baseline−1 if worse than the baseline0 if the quality is the same
The final decision D is defined as follows usingS =
∑ji:
D =
win (S ≥ 2)loss (S ≤ −2)tie (otherwise)
5.1.3 Pairwise Score CalculationSuppose that W is the number of wins com-pared to the baseline, L is the number of lossesand T is the number of ties. The Pairwise
score can be calculated by the following for-mula:
Pairwise = 100× W − L
W + L + T
From the definition, the Pairwise score rangesbetween -100 and 100.
5.1.4 Confidence Interval EstimationThere are several ways to estimate a confi-dence interval. We chose to use bootstrap re-sampling (Koehn, 2004) to estimate the 95%confidence interval. The procedure is as fol-lows:
1. randomly select 300 sentences from the400 human evaluation sentences, and cal-culate the Pairwise score of the selectedsentences
2. iterate the previous step 1000 times andget 1000 Pairwise scores
3. sort the 1000 scores and estimate the 95%confidence interval by discarding the top25 scores and the bottom 25 scores
5.2 JPO Adequacy Evaluation
The participants’ systems, which achieved thetop 3 highest scores among the pairwise eval-uation results of each subtask22, were alsoevaluated with the JPO adequacy evaluation.The JPO adequacy evaluation was carried outby translation experts with a quality evalua-tion criterion for translated patent documentswhich the Japanese Patent Office (JPO) de-cided. For each system, two annotators evalu-ate the test sentences to guarantee the quality.
5.2.1 Evaluation of SentencesThe number of test sentences for the JPO ad-equacy evaluation is 200. The 200 test sen-tences were randomly selected from the 400test sentences of the pairwise evaluation. Thetest sentence include the input sentence, thesubmitted system’s translation and the refer-ence translation.
22The number of systems varies depending on thesubtasks.
9
5 All important information is transmitted cor-rectly. (100%)
4 Almost all important information is transmit-ted correctly. (80%–)
3 More than half of important information istransmitted correctly. (50%–)
2 Some of important information is transmittedcorrectly. (20%–)
1 Almost all important information is NOTtransmitted correctly. (–20%)
Table 7: The JPO adequacy criterion
5.2.2 Evaluation CriterionTable 7 shows the JPO adequacy criterionfrom 5 to 1. The evaluation is performedsubjectively. “Important information” repre-sents the technical factors and their relation-ships. The degree of importance of each ele-ment is also considered to evaluate. The per-centages in each grade are rough indicationsfor the transmission degree of the source sen-tence meanings. The detailed criterion can befound on the JPO document (in Japanese) 23.
6 Participants ListTable 8 shows the list of participants forWAT2017. This includes not only Japanese or-ganizations, but also some organizations fromoutside Japan. 12 teams submitted one ormore translation results to the automatic eval-uation server or human evaluation.
23http://www.jpo.go.jp/shiryou/toushin/chousa/tokkyohonyaku_hyouka.htm
10
REC
IPE
ASP
ECJP
CII
TB
CJI
JIT
TL
ING
STE
Team
IDO
rgan
izat
ion
JEEJ
JCC
JJE
EJJC
CJK
JHE
EHJE
EJJE
EJJE
EJJE
EJK
yoto
-U(C
rom
iere
set
al.,
2017
)K
yoto
Uni
vers
ity✓
✓✓
✓T
MU
(Mat
sum
ura
and
Kom
achi
,201
7)To
kyo
Met
ropo
litan
Uni
vers
ity✓
✓✓
EHR
(Eha
ra,2
017)
Ehar
aN
LPR
esea
rch
Labo
rato
ry✓
✓✓
NT
T(M
orish
itaet
al.,
2017
)N
TT
Com
mun
icat
ion
Scie
nce
Labo
rato
ries
✓✓
✓✓
JAPI
O(K
inos
hita
etal
.,20
17)
Japa
nPa
tent
Info
rmat
ion
Org
aniz
atio
n✓
✓✓
✓N
ICT
-2(I
mam
ura
and
Sum
ita,2
017)
Nat
iona
lIns
titut
eof
Info
rmat
ion
and
Com
mun
icat
ions
Tech
nolo
gy✓
✓✓
✓✓
XM
UN
LP(W
ang
etal
.,20
17)
Xia
men
Uni
vers
ity✓
✓✓
✓✓
✓✓
✓✓
✓U
T-I
IS(N
eish
iet
al.,
2017
)T
heU
nive
rsity
ofTo
kyo
✓C
UN
I(K
ocm
iet
al.,
2017
)C
harle
sU
nive
rsity
,Ins
titut
eof
Form
alan
dA
pplie
dLi
ngui
stic
s✓
✓✓
IIT
B-M
TG
(Sin
ghet
al.,
2017
)In
dian
Inst
itute
ofTe
chno
logy
Bom
bay
✓✓
u-tk
b(L
ong
etal
.,20
17)
Uni
vers
ityof
Tsuk
uba
✓✓
✓✓
NA
IST
-NIC
T(O
daet
al.,
2017
)N
AIS
T/N
ICT
✓
Tabl
e8:
List
ofpa
rtic
ipan
tsw
hosu
bmitt
edtr
ansla
tion
resu
ltsto
WAT
2017
and
thei
rpa
rtic
ipat
ion
inea
chsu
btas
ks.
11
7 Evaluation ResultsIn this section, the evaluation results forWAT2017 are reported from several perspec-tives. Some of the results for both automaticand human evaluations are also accessible atthe WAT2017 website24.
7.1 Official Evaluation ResultsFigures 2, 3, 4 and 5 show the official evalu-ation results of ASPEC subtasks, Figures 6,7, 8, 9 and 10 show those of JPC subtasks,Figures 11 and 12 show those of IITBC sub-tasks, Figures 13 and 14 show those of JIJIsubtasks and Figures 15, 16, 17, 18, 19 and 20show those of RECIPE subtasks. Each figurecontains automatic evaluation results (BLEU,RIBES, AM-FM), the pairwise evaluation re-sults with confidence intervals, correlation be-tween automatic evaluations and the pairwiseevaluation, the JPO adequacy evaluation re-sult and evaluation summary of top systems.
The detailed automatic evaluation resultsfor all the submissions are shown in AppendixA. The detailed JPO adequacy evaluation re-sults for the selected submissions are shownin Table 9. The weights for the weighted κ(Cohen, 1968) is defined as |Evaluation1 −Evaluation2|/4.
From the evaluation results, the followingcan be observed:
• The translation quality of this year is bet-ter than that of last year for all the sub-tasks.
• There is no big difference between theneural network based translation modelsaccording to the JPO adequacy evalua-tion results for ASPEC subtasks.
7.2 Statistical Significance Testing ofPairwise Evaluation betweenSubmissions
Tables 10, 11 and 12 show the results of statis-tical significance testing of ASPEC subtasks,Tables 13, 14 and 15 show those of JPC sub-tasks, Table 16 shows those of IITBC subtasks,Table 17 shows those of JIJI subtasks and Ta-bles 18, 19 and 20 show those of RECIPE sub-tasks. ≫, ≫ and > mean that the system in
24http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/
the row is better than the system in the col-umn at a significance level of p < 0.01, 0.05and 0.1 respectively. Testing is also done bythe bootstrap resampling as follows:
1. randomly select 300 sentences from the400 pairwise evaluation sentences, andcalculate the Pairwise scores on the se-lected sentences for both systems
2. iterate the previous step 1000 times andcount the number of wins (W ), losses (L)and ties (T )
3. calculate p = LW+L
Inter-annotator AgreementTo assess the reliability of agreement betweenthe workers, we calculated the Fleiss’ κ (Fleisset al., 1971) values. The results are shown inTable 21. We can see that the κ values arelarger for X → J translations than for J → Xtranslations. This may be because the major-ity of the workers are Japanese, and the eval-uation of one’s mother tongue is much easierthan for other languages in general.
8 Submitted Data
The number of published automatic evalua-tion results for the 14 teams exceeded 300 be-fore the start of WAT2017, and 67 translationresults for pairwise evaluation were submittedby 12 teams. Furthermore, we selected severaltranslation results from each subtask accord-ing to the pairwise evaluation scores and eval-uated them for JPO adequacy evaluation. Wewill organize the all of the submitted data forhuman evaluation and make this public.
9 Conclusion and FuturePerspective
This paper summarizes the shared tasks ofWAT2017. We had 12 participants worldwide,and collected a large number of useful submis-sions for improving the current machine trans-lation systems by analyzing the submissionsand identifying the issues.
For the next WAT workshop, we plan tochange the baseline system from the PBSMTto NMT because the pairwise scores are sat-urated for some of the subtasks. Also, we
12
are planning to do extrinsic evaluation of thetranslations.
Unfortunately, there was no participants forthe small NMT task this year. We will brush-up the task definition and invite participantsfor the next WAT.
Appendix A SubmissionsTables 22 to 41 summarize all the submissionslisted in the automatic evaluation server at thetime of the WAT2017 workshop (27th, Novem-ber, 2017). The OTHER column shows the useof resources such as parallel corpora, monolin-gual corpora and parallel dictionaries in addi-tion to ASPEC, JPC, IITB Corpus, JIJI Cor-pus, RECIPE Corpus.
13
Figure 2: Official evaluation results of ASPEC-JE.
14
Figure 3: Official evaluation results of ASPEC-EJ.
15
Figure 4: Official evaluation results of ASPEC-JC.
16
Figure 5: Official evaluation results of ASPEC-CJ.
17
Figure 6: Official evaluation results of JPC-JE.
18
Figure 7: Official evaluation results of JPC-EJ.
19
Figure 8: Official evaluation results of JPC-JC.
20
Figure 9: Official evaluation results of JPC-CJ.
21
Figure 10: Official evaluation results of JPC-KJ.
22
Figure 11: Official evaluation results of IITBC-HE.
23
Figure 12: Official evaluation results of IITBC-EH.
24
Figure 13: Official evaluation results of JIJI-JE.
25
Figure 14: Official evaluation results of JIJI-EJ.
26
Figure 15: Official evaluation results of RECIPE-TTL-JE.
27
Figure 16: Official evaluation results of RECIPE-TTL-EJ.
28
Figure 17: Official evaluation results of RECIPE-ING-JE.
29
Figure 18: Official evaluation results of RECIPE-ING-EJ.
30
Figure 19: Official evaluation results of RECIPE-STE-JE.
31
Figure 20: Official evaluation results of RECIPE-STE-EJ.
32
SYSTEM DATA Annotator A Annotator B all weightedSubtask ID ID average varianceaverage varianceaverage κ κ
ASPEC-JENTT 1681 4.15 0.58 4.13 0.52 4.14 0.29 0.41
AIAYN 1736 4.16 0.67 4.05 0.75 4.10 0.26 0.42Kyoto-U 1717 4.11 0.69 4.09 0.54 4.10 0.26 0.402016 best 1246 3.76 0.68 4.01 0.67 3.89 0.21 0.31
ASPEC-EJ
NTT 1729 4.54 0.56 4.28 0.49 4.41 0.33 0.43AIAYN 1737 4.38 0.83 4.21 0.76 4.30 0.36 0.52NICT-2 1479 4.43 0.73 4.16 0.69 4.29 0.35 0.48Kyoto-U 1731 4.37 0.84 4.15 0.74 4.26 0.39 0.54
NAIST-NICT 1507 4.36 0.69 4.06 0.57 4.21 0.26 0.362016 best 1172 3.97 0.76 4.07 0.85 4.02 0.35 0.49
ASPEC-JCNICT-2 1483 4.25 0.73 3.71 0.98 3.98 0.10 0.18Kyoto-U 1722 4.25 0.79 3.64 1.07 3.95 0.12 0.23AIAYN 1738 4.26 0.69 3.54 1.03 3.90 0.17 0.27
2016 best 1071 4.00 1.09 3.76 1.14 3.88 0.20 0.36
ASPEC-CJNICT-2 1481 4.63 0.47 3.99 0.98 4.31 0.17 0.23Kyoto-U 1720 4.62 0.56 3.97 0.94 4.30 0.16 0.22AIAYN 1740 4.59 0.61 3.96 1.04 4.27 0.14 0.23
2016 best 1256 4.25 1.04 3.64 1.23 3.94 0.23 0.34
JPC-JEJAPIO 1574 4.80 0.26 4.78 0.51 4.79 0.34 0.42u-tkb 1472 4.24 1.26 4.08 2.27 4.16 0.43 0.64CUNI 1666 4.12 1.49 3.99 2.35 4.05 0.40 0.63
2016 best 1149 4.09 0.80 4.51 0.58 4.30 0.25 0.39
JPC-EJJAPIO 1454 4.74 0.45 4.76 0.38 4.75 0.32 0.48EHR 1407 4.64 0.61 4.61 0.65 4.63 0.42 0.60u-tkb 1470 4.39 1.07 4.42 0.99 4.40 0.43 0.61
2016 best 1098 4.03 0.91 4.51 0.57 4.27 0.23 0.41JPC-JC u-tkb 1465 3.99 1.12 4.19 0.94 4.09 0.22 0.32
2016 best 1150 3.49 1.72 3.02 1.75 3.25 0.27 0.51
JPC-CJJAPIO 1484 4.41 0.68 4.51 0.64 4.46 0.26 0.34EHR 1414 4.27 0.92 4.35 1.03 4.31 0.33 0.48u-tkb 1468 3.84 1.16 4.04 1.36 3.94 0.23 0.43
2016 best 1200 3.61 1.89 3.27 1.76 3.44 0.26 0.52
JPC-KJJAPIO 1448 4.82 0.24 4.87 0.11 4.84 0.55 0.55EHR 1417 4.76 0.30 4.86 0.23 4.81 0.35 0.47
2016 best 1209 4.58 0.32 4.66 0.30 4.62 0.33 0.36IITBC-HE XMUNLP 1511 3.43 1.64 3.60 1.74 3.51 0.22 0.45
IITB-MTG 1726 2.14 1.45 2.45 1.87 2.29 0.30 0.51
IITBC-EHXMUNLP 1576 3.95 1.18 3.76 1.85 3.86 0.17 0.36IITB-MTG 1725 2.78 1.74 2.58 1.87 2.68 0.15 0.382016 best 1032 3.20 1.33 3.53 1.19 3.36 0.10 0.16
JIJI-JEOnline A 1523 3.03 1.60 3.28 2.24 3.15 0.15 0.37
NTT 1599 1.87 1.25 2.23 1.69 2.05 0.26 0.46XMUNLP 1442 1.91 1.26 2.19 1.56 2.05 0.24 0.44
JIJI-EJOnline A 1518 3.31 1.92 3.78 2.06 3.54 0.23 0.50
NTT 1679 1.78 1.18 2.28 1.97 2.03 0.29 0.52XMUNLP 1443 1.72 1.02 2.20 1.70 1.96 0.33 0.51
RECIPE-TTL-JE XMUNLP 1637 3.90 1.98 3.62 1.57 3.76 0.30 0.56Online A 1534 3.52 2.04 3.16 2.07 3.34 0.36 0.60
RECIPE-TTL-EJ Online B 1533 4.56 0.55 3.84 2.02 4.20 0.25 0.35XMUNLP 1636 4.54 0.62 3.62 2.40 4.08 0.26 0.34
RECIPE-ING-JE XMULNP 1635 4.68 0.65 4.58 0.85 4.63 0.47 0.67Online A 1544 4.29 1.44 4.23 1.57 4.26 0.55 0.76
RECIPE-ING-EJ XMUNLP 1634 4.71 0.43 4.43 1.03 4.57 0.40 0.53Online A 1542 4.50 0.95 4.54 0.91 4.52 0.50 0.65
RECIPE-STE-JE XMUNLP 1632 4.61 0.76 3.98 0.96 4.29 0.13 0.28Online A 1551 3.34 1.54 2.69 1.21 3.01 0.14 0.36
RECIPE-STE-EJ XMUNLP 1633 4.75 0.36 4.04 1.33 4.39 0.12 0.21Online A 1549 4.18 0.42 3.16 1.52 3.67 0.11 0.17
Table 9: JPO adequacy evaluation results in detail.
33
NT
T(1
681)
OR
GA
NIZ
ER(1
736)
NT
T(1
616)
Kyo
to-U
(173
3)N
ICT
-2(1
480)
NIC
T-2
(147
6)C
UN
I(1
665)
OR
GA
NIZ
ER(1
333)
TM
U(1
703)
TM
U(1
695)
Kyoto-U (1717) - ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫NTT (1681) > ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ORGANIZER (1736) - - ≫ ≫ ≫ ≫ ≫ ≫NTT (1616) - ≫ ≫ ≫ ≫ ≫ ≫Kyoto-U (1733) ≫ ≫ ≫ ≫ ≫ ≫NICT-2 (1480) - ≫ ≫ ≫ ≫NICT-2 (1476) ≫ ≫ ≫ ≫CUNI (1665) > ≫ ≫ORGANIZER (1333) - ≫TMU (1703) ≫
Table 10: Statistical significance testing of the ASPEC-JE Pairwise scores.N
ICT
-2(1
479)
OR
GA
NIZ
ER(1
334)
NT
T(1
684)
NA
IST
-NIC
T(1
507)
OR
GA
NIZ
ER(1
737)
Kyo
to-U
(173
1)U
T-I
IS(1
710)
NA
IST
-NIC
T(1
506)
NIC
T-2
(147
5)T
MU
(170
9)T
MU
(170
4)NTT (1729) - - ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫NICT-2 (1479) - ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ORGANIZER (1334) - ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫NTT (1684) ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫NAIST-NICT (1507) - - > ≫ ≫ ≫ ≫ORGANIZER (1737) - > ≫ ≫ ≫ ≫Kyoto-U (1731) > ≫ ≫ ≫ ≫UT-IIS (1710) ≫ ≫ ≫ ≫NAIST-NICT (1506) - ≫ ≫NICT-2 (1475) ≫ ≫TMU (1709) ≫
Table 11: Statistical significance testing of the ASPEC-EJ Pairwise scores.
Kyo
to-U
(164
2)O
RG
AN
IZER
(173
8)N
ICT
-2(1
483)
NIC
T-2
(147
8)O
RG
AN
IZER
(133
6)T
MU
(174
3)
Kyoto-U (1722) - > ≫ ≫ ≫ ≫Kyoto-U (1642) - > ≫ ≫ ≫ORGANIZER (1738) - ≫ ≫ ≫NICT-2 (1483) > ≫ ≫NICT-2 (1478) ≫ ≫ORGANIZER (1336) ≫
Kyo
to-U
(157
7)N
ICT
-2(1
481)
OR
GA
NIZ
ER(1
740)
NIC
T-2
(147
7)O
RG
AN
IZER
(134
2)
Kyoto-U (1720) ≫ ≫ ≫ ≫ ≫Kyoto-U (1577) - - - ≫NICT-2 (1481) - - ≫ORGANIZER (1740) - ≫NICT-2 (1477) ≫
Table 12: Statistical significance testing of the ASPEC-JC (left) and ASPEC-CJ (right) Pairwisescores.
34
JAPI
O(1
574)
JAPI
O(1
578)
CU
NI
(166
6)u-
tkb
(147
2)
ORGANIZER (1338) ≫ ≫ ≫ ≫JAPIO (1574) - ≫ ≫JAPIO (1578) ≫ ≫CUNI (1666) ≫
EHR
(140
6)JA
PIO
(145
4)u-
tkb
(147
0)O
RG
AN
IZER
(133
9)JA
PIO
(146
2)
EHR (1407) - ≫ ≫ ≫ ≫EHR (1406) - ≫ ≫ ≫JAPIO (1454) ≫ ≫ ≫u-tkb (1470) - ≫ORGANIZER (1339) ≫
Table 13: Statistical significance testing of the JPC-JE (left) and JPC-EJ (right) Pairwise scores.u-
tkb
(146
5)
ORGANIZER (1340) ≫
EHR
(141
4)EH
R(1
408)
JAPI
O(1
447)
u-tk
b(1
468)
OR
GA
NIZ
ER(1
341)
JAPIO (1484) ≫ ≫ ≫ ≫ ≫EHR (1414) - ≫ ≫ ≫EHR (1408) ≫ ≫ ≫JAPIO (1447) ≫ ≫u-tkb (1468) -
Table 14: Statistical significance testing of the JPC-JC (left) and JPC-CJ (right) Pairwise scores.
JAPI
O(1
450)
EHR
(141
7)EH
R(1
416)
OR
GA
NIZ
ER(1
344)
JAPIO (1448) - ≫ ≫ ≫JAPIO (1450) ≫ ≫ ≫EHR (1417) ≫ ≫EHR (1416) ≫
Table 15: Statistical significance testing of the JPC-KJ Pairwise scores.
IIT
B-M
TG
(172
6)
XMUNLP (1511) ≫
IIT
B-M
TG
(172
5)
XMUNLP (1576) ≫
Table 16: Statistical significance testing of the IITBC-HE (left) and IITBC-EH (right) Pairwisescores.
35
OR
GA
NIZ
ER(1
526)
NT
T(1
599)
NT
T(1
677)
XM
UN
LP(1
442)
OR
GA
NIZ
ER(1
396)
NIC
T-2
(147
4)N
ICT
-2(1
473)
CU
NI
(166
8)
ORGANIZER (1523) ≫ ≫ ≫ ≫ ≫ ≫ ≫ ≫ORGANIZER (1526) ≫ ≫ ≫ ≫ ≫ ≫ ≫NTT (1599) ≫ ≫ ≫ ≫ ≫ ≫NTT (1677) ≫ ≫ ≫ ≫ ≫XMUNLP (1442) ≫ ≫ ≫ ≫ORGANIZER (1396) > ≫ ≫NICT-2 (1474) ≫ ≫NICT-2 (1473) ≫
OR
GA
NIZ
ER(1
514)
NT
T(1
679)
NT
T(1
603)
XM
UN
LP(1
443)
OR
GA
NIZ
ER(1
395)
ORGANIZER (1518) ≫ ≫ ≫ ≫ ≫ORGANIZER (1514) ≫ ≫ ≫ ≫NTT (1679) ≫ ≫ ≫NTT (1603) - ≫XMUNLP (1443) -
Table 17: Statistical significance testing of the JIJI-JE (left) and JIJI-EJ (right) Pairwise scores.
XM
UN
LP(1
637)
OR
GA
NIZ
ER(1
531)
ORGANIZER (1534) - ≫XMUNLP (1637) ≫
OR
GA
NIZ
ER(1
533)
OR
GA
NIZ
ER(1
528)
XMUNLP (1636) ≫ ≫ORGANIZER (1533) ≫
Table 18: Statistical significance testing of the RECIPE-TTL-JE (left) and RECIPE-TTL-EJ(right) Pairwise scores.
OR
GA
NIZ
ER(1
544)
OR
GA
NIZ
ER(1
539)
XMUNLP (1635) ≫ ≫ORGANIZER (1544) ≫
OR
GA
NIZ
ER(1
542)
OR
GA
NIZ
ER(1
537)
XMUNLP (1634) ≫ ≫ORGANIZER (1542) ≫
Table 19: Statistical significance testing of the RECIPE-ING-JE (left) and RECIPE-ING-EJ(right) Pairwise scores.
OR
GA
NIZ
ER(1
551)
OR
GA
NIZ
ER(1
548)
XMUNLP (1632) ≫ ≫ORGANIZER (1551) ≫
OR
GA
NIZ
ER(1
549)
OR
GA
NIZ
ER(1
546)
XMUNLP (1633) ≫ ≫ORGANIZER (1549) ≫
Table 20: Statistical significance testing of the RECIPE-STE-JE (left) and RECIPE-STE-EJ(right) Pairwise scores.
36
ASPEC-JESYSTEMDATA κOnline D 1333 0.230AIAYN 1736 0.204Kyoto-U 1717 0.217Kyoto-U 1733 0.204TMU 1695 0.188TMU 1703 0.191NTT 1616 0.201NTT 1681 0.173NICT-2 1476 0.274NICT-2 1480 0.257CUNI 1665 0.241ave. 0.216
ASPEC-EJSYSTEM DATA κOnline A 1334 0.290AIAYN 1737 0.338Kyoto-U 1731 0.321TMU 1704 0.269TMU 1709 0.260NTT 1684 0.353NTT 1729 0.341NICT-2 1475 0.315NICT-2 1479 0.395UT-IIS 1710 0.305NAIST-NICT 1506 0.301NAIST-NICT 1507 0.339ave. 0.319
ASPEC-JCSYSTEMDATA κOnline D 1336 0.189AIAYN 1738 0.183Kyoto-U 1642 0.159Kyoto-U 1722 0.128TMU 1743 0.171NICT-2 1478 0.222NICT-2 1483 0.194ave. 0.178
ASPEC-CJSYSTEMDATA κOnline A 1342 0.215AIAYN 1740 0.310Kyoto-U 1577 0.284Kyoto-U 1720 0.254NICT-2 1477 0.191NICT-2 1481 0.279ave. 0.255
JPC-JESYSTEMDATA κOnline A 1338 0.424JAPIO 1574 0.280JAPIO 1578 0.296CUNI 1666 0.249u-tkb 1472 0.380ave. 0.326
JPC-EJSYSTEMDATA κOnline A 1339 0.410EHR 1406 0.364EHR 1407 0.385JAPIO 1454 0.409JAPIO 1462 0.280u-tkb 1470 0.349ave. 0.366
JPC-JCSYSTEMDATA κOnline A 1340 0.185u-tkb 1465 0.176ave. 0.180
JPC-CJSYSTEMDATA κOnline A 1341 0.194EHR 1408 0.201EHR 1414 0.170JAPIO 1447 0.257JAPIO 1484 0.247u-tkb 1468 0.172ave. 0.207
JPC-KJSYSTEMDATA κOnline A 1344 0.257EHR 1416 0.413EHR 1417 0.459JAPIO 1448 0.224JAPIO 1450 0.235ave. 0.317
IITBC-HESYSTEM DATA κXMUNLP 1511 0.376IITB-MTG 1726 0.626ave. 0.501
IITBC-EHSYSTEM DATA κXMUNLP 1576 0.269IITB-MTG 1725 0.371ave. 0.320
JIJI-JESYSTEM DATA κHiero 1396 0.117Online A 1523 0.035RBMT B 1526 0.004NTT 1599 0.095NTT 1677 0.077NICT-2 1473 0.078NICT-2 1474 0.064XMUNLP 1442 0.070CUNI 1668 0.060ave. 0.067
JIJI-EJSYSTEM DATA κHiero 1395 0.104RBMT A 1514 0.167Online A 1518 0.179NTT 1603 0.189NTT 1679 0.155XMUNLP 1443 0.151ave. 0.157
RECIPE-TTL-JESYSTEM DATA κRBMT B 1531 0.305Online A 1534 0.333XMUNLP 1637 0.366ave. 0.334
RECIPE-TTL-EJSYSTEM DATA κRBMT B 1528 0.340Online B 1533 0.356XMUNLP 1636 0.341ave. 0.345
RECIPE-STE-JESYSTEM DATA κRBMT B 1548 0.290Online A 1551 0.289XMUNLP 1632 0.261ave. 0.280
RECIPE-STE-EJSYSTEM DATA κRBMT B 1546 0.108Online A 1549 0.138XMUNLP 1633 0.162ave. 0.136
RECIPE-ING-JESYSTEM DATA κRBMT B 1539 0.537Online B 1544 0.551XMUNLP 1635 0.614ave. 0.567
RECIPE-ING-EJSYSTEM DATA κRBMT B 1537 0.665Online B 1542 0.515XMUNLP 1634 0.618ave. 0.599
Table 21: The Fleiss’ kappa values for the pairwise evaluation results.
37
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
SMT
Hie
ro2
SMT
NO
18.7
20.
6510
660.
5888
80+
7.75
SMT
Phra
se6
SMT
NO
18.4
50.
6451
370.
5909
50—
–SM
TS2
T9
SMT
NO
20.3
60.
6782
530.
5934
10+
25.5
0O
nlin
eD
(201
4)35
Oth
erY
ES15
.08
0.64
3588
0.56
4170
+13
.75
RBM
TE
76O
ther
YES
14.8
20.
6638
510.
5616
20—
–R
BMT
F79
Oth
erY
ES13
.86
0.66
1387
0.55
6840
—–
Onl
ine
C(2
014)
87O
ther
YES
10.6
40.
6248
270.
4664
80—
–R
BMT
D(2
014)
96O
ther
YES
15.2
90.
6833
780.
5516
90+
23.0
0O
nlin
eD
(201
5)77
5O
ther
YES
16.8
50.
6766
090.
5622
70+
0.25
SMT
S2T
877
SMT
NO
20.3
60.
6782
530.
5934
10+
7.00
RBM
TD
(201
5)88
7O
ther
YES
15.2
90.
6833
780.
5516
90+
16.7
5O
nlin
eC
(201
5)89
2O
ther
YES
10.2
90.
6225
640.
4533
70—
–O
nlin
eD
(201
6)10
42O
ther
YES
16.9
10.
6774
120.
5642
70+
28.0
0O
nlin
eD
(201
6/11
)13
33N
MT
YES
22.0
40.
7334
830.
5843
90+
63.0
0A
IAY
N17
36N
MT
NO
28.0
60.
7675
770.
5955
80+
75.2
5K
yoto
-U1
1717
NM
TN
O27
.53
0.76
1403
0.58
5540
+77
.75
Kyo
to-U
217
33N
MT
NO
27.6
60.
7654
640.
5911
60+
74.5
0T
MU
116
95N
MT
NO
21.0
00.
7252
840.
5857
10+
56.7
5T
MU
217
03N
MT
NO
23.0
30.
7411
750.
5952
60+
61.0
0N
TT
116
16N
MT
NO
27.4
30.
7648
310.
5976
20+
75.0
0N
TT
216
81N
MT
NO
28.3
60.
7688
800.
5978
60+
77.2
5N
ICT
-21
1476
NM
TN
O24
.79
0.74
7335
0.57
4810
+68
.75
NIC
T-2
214
80N
MT
NO
26.7
60.
7413
290.
5781
50+
69.7
5C
UN
I116
65N
MT
NO
23.4
30.
7416
990.
5837
80+
66.0
0
Tabl
e22
:A
SPEC
-JE
subm
issio
ns
38
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TPh
rase
5SM
TN
O27
.48
29.8
028
.27
0.68
3735
0.69
1926
0.69
5390
0.73
6380
0.73
6380
0.73
6380
—–
SMT
T2S
12SM
TN
O31
.05
33.4
432
.10
0.74
8883
0.75
8031
0.76
0516
0.74
4370
0.74
4370
0.74
4370
+34
.25
Onl
ine
A(2
014)
34O
ther
YES
19.6
621
.63
20.1
70.
7180
190.
7234
860.
7258
480.
6954
200.
6954
200.
6954
20+
42.5
0R
BMT
B(2
014)
66O
ther
YES
13.1
814
.85
13.4
80.
6719
580.
6807
480.
6826
830.
6229
300.
6229
300.
6229
30+
0.75
RBM
TA
68O
ther
YES
12.8
614
.43
13.1
60.
6701
670.
6764
640.
6789
340.
6269
400.
6269
400.
6269
40—
–O
nlin
eB
(201
4)91
Oth
erY
ES17
.04
18.6
717
.36
0.68
7797
0.69
3390
0.69
8126
0.64
3070
0.64
3070
0.64
3070
—–
RBM
TC
95O
ther
YES
12.1
913
.32
12.1
40.
6683
720.
6726
450.
6760
180.
5943
800.
5943
800.
5943
80—
–SM
TH
iero
367
SMT
NO
30.1
932
.56
30.9
40.
7347
050.
7469
780.
7477
220.
7439
000.
7439
000.
7439
00+
31.5
0O
nlin
eA
(201
5)77
4O
ther
YES
18.2
219
.77
18.4
60.
7058
820.
7139
600.
7181
500.
6772
000.
6772
000.
6772
00+
34.2
5SM
TT
2S87
5SM
TN
O31
.05
33.4
432
.10
0.74
8883
0.75
8031
0.76
0516
0.74
4370
0.74
4370
0.74
4370
+30
.00
RBM
TB
(201
5)88
3O
ther
YES
13.1
814
.85
13.4
80.
6719
580.
6807
480.
6826
830.
6229
300.
6229
300.
6229
30+
9.75
Onl
ine
B(2
015)
889
Oth
erY
ES17
.80
19.5
218
.11
0.69
3359
0.70
1966
0.70
3859
0.64
6160
0.64
6160
0.64
6160
—–
Onl
ine
A(2
016)
1041
Oth
erY
ES18
.28
19.8
118
.51
0.70
6639
0.71
5222
0.71
8559
0.67
7020
0.67
7020
0.67
7020
+49
.75
Onl
ine
A(2
016/
11)
1334
NM
TY
ES26
.19
28.2
226
.68
0.77
6787
0.78
0217
0.78
2674
0.72
7040
0.72
7040
0.72
7040
+74
.00
AIA
YN
1737
NM
TN
O40
.79
42.5
541
.50
0.84
4896
0.84
7559
0.85
1471
0.76
8630
0.76
8630
0.76
8630
+69
.75
Tabl
e23
:A
SPEC
-EJ
subm
issio
ns(O
rgan
izer
)
39
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abK
yoto
-U1
1731
NM
TN
O38
.72
40.6
539
.37
0.83
2472
0.83
5870
0.83
9646
0.75
4220
0.75
4220
0.75
4220
+69
.75
TM
U1
1704
NM
TN
O32
.65
35.0
533
.72
0.80
2262
0.80
9649
0.81
1057
0.74
0620
0.74
0620
0.74
0620
+50
.75
TM
U2
1709
NM
TN
O34
.05
36.6
935
.32
0.81
2926
0.81
8443
0.82
1563
0.74
4890
0.74
4890
0.74
4890
+56
.50
NT
T1
1684
NM
TN
O39
.80
42.2
740
.47
0.83
5806
0.83
9981
0.84
4326
0.75
7740
0.75
7740
0.75
7740
+72
.25
NT
T2
1729
NM
TN
O40
.32
42.8
040
.95
0.83
8594
0.84
1769
0.84
6486
0.76
2170
0.76
2170
0.76
2170
+75
.75
NIC
T-2
114
75N
MT
NO
36.8
538
.94
37.8
70.
8267
910.
8344
480.
8352
550.
7595
700.
7595
700.
7595
70+
62.0
0N
ICT
-22
1479
NM
TN
O40
.17
42.2
541
.17
0.84
2206
0.84
8170
0.84
9929
0.76
5580
0.76
5580
0.76
5580
+74
.75
UT
-IIS
117
10N
MT
NO
36.2
638
.93
37.0
60.
8278
910.
8320
540.
8363
940.
7469
100.
7469
100.
7469
10+
68.0
0N
AIS
T-N
ICT
115
06N
MT
NO
36.4
738
.54
37.3
00.
8219
890.
8272
250.
8301
160.
7633
100.
7633
100.
7633
10+
63.5
0N
AIS
T-N
ICT
215
07N
MT
NO
38.2
540
.29
39.0
50.
8344
920.
8393
210.
8423
370.
7704
800.
7704
800.
7704
80+
70.0
0
Tabl
e24
:A
SPEC
-EJ
subm
issio
ns(P
artic
ipan
ts)
40
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
kyte
ast
anfo
rd(c
tb)
stan
ford
(pku
)ky
tea
stan
ford
(ctb
)st
anfo
rd(p
ku)
kyte
ast
anfo
rd(c
tb)
stan
ford
(pku
)SM
TH
iero
3SM
TN
O27
.71
27.7
027
.35
0.80
9128
0.80
9561
0.81
1394
0.74
5100
0.74
5100
0.74
5100
+3.
75SM
TPh
rase
7SM
TN
O27
.96
28.0
127
.68
0.78
8961
0.79
0263
0.79
0937
0.74
9450
0.74
9450
0.74
9450
—–
SMT
S2T
10SM
TN
O28
.65
28.6
528
.35
0.80
7606
0.80
9457
0.80
8417
0.75
5230
0.75
5230
0.75
5230
+14
.00
Onl
ine
D(2
014)
37O
ther
YES
9.37
8.93
8.84
0.60
6905
0.60
6328
0.60
4149
0.62
5430
0.62
5430
0.62
5430
-14.
50O
nlin
eC
(201
4)21
6O
ther
YES
7.26
7.01
6.72
0.61
2808
0.61
3075
0.61
1563
0.58
7820
0.58
7820
0.58
7820
—–
RBM
TB
(201
4)24
3R
BMT
NO
17.8
617
.75
17.4
90.
7448
180.
7458
850.
7437
940.
6679
600.
6679
600.
6679
60-2
0.00
RBM
TC
244
RBM
TN
O9.
629.
969.
590.
6422
780.
6487
580.
6453
850.
5949
000.
5949
000.
5949
00—
–O
nlin
eD
(201
5)77
7O
ther
YES
10.7
310
.33
10.0
80.
6604
840.
6608
470.
6604
820.
6340
900.
6340
900.
6340
90-1
4.75
SMT
S2T
881
SMT
NO
28.6
528
.65
28.3
50.
8076
060.
8094
570.
8084
170.
7552
300.
7552
300.
7552
30+
7.75
RBM
TB
(201
5)88
6O
ther
YES
17.8
617
.75
17.4
90.
7448
180.
7458
850.
7437
940.
6679
600.
6679
600.
6679
60-1
1.00
Onl
ine
C(2
015)
891
Oth
erY
ES7.
447.
056.
750.
6119
640.
6150
480.
6121
580.
5660
600.
5660
600.
5660
60—
–O
nlin
eD
(201
6)10
45O
ther
YES
11.1
610
.72
10.5
40.
6651
850.
6673
820.
6669
530.
6394
400.
6394
400.
6394
40-2
6.00
Onl
ine
D(2
016/
11)
1336
NM
TY
ES15
.94
15.6
815
.38
0.72
8453
0.72
8270
0.72
8284
0.67
3730
0.67
3730
0.67
3730
+17
.75
AIA
YN
1738
NM
TN
O34
.97
34.9
634
.72
0.85
0199
0.85
0052
0.84
8394
0.78
7250
0.78
7250
0.78
7250
+70
.50
Kyo
to-U
116
42N
MT
NO
35.6
735
.30
35.4
00.
8494
640.
8481
070.
8483
180.
7794
000.
7794
000.
7794
00+
71.5
0K
yoto
-U2
1722
NM
TN
O35
.31
35.3
735
.06
0.85
0103
0.84
9168
0.84
7879
0.78
5420
0.78
5420
0.78
5420
+72
.50
TM
U1
1743
NM
TN
O22
.92
22.8
622
.74
0.79
8681
0.79
8736
0.79
7969
0.70
0030
0.70
0030
0.70
0030
+4.
25N
ICT
-21
1478
NM
TN
O33
.72
33.6
433
.60
0.84
7223
0.84
6578
0.84
6158
0.77
9870
0.77
9870
0.77
9870
+67
.25
NIC
T-2
214
83N
MT
NO
35.2
335
.23
35.1
40.
8520
840.
8518
930.
8515
480.
7858
200.
7858
200.
7858
20+
69.5
0
Tabl
e25
:A
SPEC
-JC
subm
issio
ns
41
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TH
iero
4SM
TN
O35
.43
35.9
135
.64
0.81
0406
0.79
8726
0.80
7665
0.75
0950
0.75
0950
0.75
0950
+4.
75SM
TPh
rase
8SM
TN
O34
.65
35.1
634
.77
0.77
2498
0.76
6384
0.77
1005
0.75
3010
0.75
3010
0.75
3010
—–
SMT
T2S
13SM
TN
O36
.52
37.0
736
.64
0.82
5292
0.82
0490
0.82
5025
0.75
4870
0.75
4870
0.75
4870
+16
.00
Onl
ine
A(2
014)
36O
ther
YES
11.6
313
.21
11.8
70.
5959
250.
5981
720.
5985
730.
6580
600.
6580
600.
6580
60-2
1.75
Onl
ine
B(2
014)
215
Oth
erY
ES10
.48
11.2
610
.47
0.60
0733
0.59
6006
0.60
0706
0.63
6930
0.63
6930
0.63
6930
—–
RBM
TA
(201
4)23
9R
BMT
NO
9.37
9.87
9.35
0.66
6277
0.65
2402
0.66
1730
0.62
6070
0.62
6070
0.62
6070
-37.
75R
BMT
D24
2R
BMT
NO
8.39
8.70
8.30
0.64
1189
0.62
6400
0.63
3319
0.58
6790
0.58
6790
0.58
6790
—–
Onl
ine
A(2
015)
776
Oth
erY
ES11
.53
12.8
211
.68
0.58
8285
0.59
0393
0.59
2887
0.64
9860
0.64
9860
0.64
9860
-19.
00SM
TT
2S87
9SM
TN
O36
.52
37.0
736
.64
0.82
5292
0.82
0490
0.82
5025
0.75
4870
0.75
4870
0.75
4870
+17
.25
RBM
TA
(201
5)88
5O
ther
YES
9.37
9.87
9.35
0.66
6277
0.65
2402
0.66
1730
0.62
6070
0.62
6070
0.62
6070
-28.
00O
nlin
eB
(201
5)89
0O
ther
YES
10.4
111
.03
10.3
60.
5973
550.
5928
410.
5972
980.
6282
900.
6282
900.
6282
90—
–O
nlin
eA
(201
6)10
43O
ther
YES
11.5
612
.87
11.6
90.
5898
020.
5893
970.
5933
610.
6595
400.
6595
400.
6595
40-5
1.25
Onl
ine
A(2
016/
11)
1342
NM
TY
ES18
.75
20.6
419
.04
0.71
9022
0.71
7173
0.72
0095
0.69
2820
0.69
2820
0.69
2820
+22
.50
AIA
YN
1740
NM
TN
O46
.87
47.3
047
.00
0.88
0815
0.87
5511
0.88
0368
0.79
8110
0.79
8110
0.79
8110
+78
.50
Kyo
to-U
115
77N
MT
NO
48.4
348
.84
48.5
10.
8834
570.
8789
640.
8841
370.
7995
200.
7995
200.
7995
20+
79.5
0K
yoto
-U2
1720
NM
TN
O48
.34
48.7
648
.40
0.88
4210
0.88
0069
0.88
4745
0.79
9840
0.79
9840
0.79
9840
+82
.75
NIC
T-2
114
77N
MT
NO
44.2
644
.90
44.5
00.
8714
380.
8683
590.
8717
360.
7889
400.
7889
400.
7889
40+
78.0
0N
ICT
-22
1481
NM
TN
O46
.84
47.5
147
.27
0.88
2356
0.87
8580
0.88
2195
0.79
9680
0.79
9680
0.79
9680
+79
.00
Tabl
e26
:A
SPEC
-CJ
subm
issio
ns
42
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
SMT
Phra
se97
7SM
TN
O30
.80
0.73
0056
0.66
4830
—–
SMT
Hie
ro97
9SM
TN
O32
.23
0.76
3030
0.67
2500
+8.
75SM
TS2
T98
0SM
TN
O34
.40
0.79
3483
0.67
2760
+23
.00
Onl
ine
A(2
016)
1035
Oth
erY
ES35
.77
0.80
3661
0.67
3950
+32
.25
Onl
ine
B(2
016)
1051
Oth
erY
ES16
.00
0.68
8004
0.48
6450
—–
RBM
TC
(201
6)10
88O
ther
YES
21.0
00.
7550
170.
5192
10—
–R
BMT
A(2
016)
1090
Oth
erY
ES21
.57
0.75
0381
0.52
1230
+23
.75
RBM
TB
(201
6)10
95O
ther
YES
18.3
80.
7109
920.
5181
10—
–O
nlin
eA
(201
6/11
)13
38N
MT
YES
49.3
50.
8783
420.
7225
90+
71.5
0JA
PIO
115
74N
MT
YES
49.0
00.
8782
980.
7247
10+
68.5
0JA
PIO
215
78N
MT
YES
48.0
80.
8730
930.
7155
60+
67.0
0C
UN
I116
66SM
TN
O38
.29
0.83
7425
0.68
1520
+58
.00
u-tk
b1
1472
NM
TN
O37
.31
0.84
1136
0.69
7290
+51
.50
Tabl
e27
:JP
C-J
Esu
bmiss
ions
43
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TPh
rase
973
SMT
NO
32.3
634
.26
32.5
20.
7285
390.
7282
810.
7290
770.
7119
000.
7119
000.
7119
00—
–SM
TH
iero
974
SMT
NO
34.5
736
.61
34.7
90.
7777
590.
7786
570.
7790
490.
7153
000.
7153
000.
7153
00+
21.0
0SM
TT
2S97
5SM
TN
O35
.60
37.6
535
.82
0.79
7353
0.79
6783
0.79
8025
0.71
7030
0.71
7030
0.71
7030
+30
.75
Onl
ine
A(2
016)
1036
Oth
erY
ES36
.88
37.8
936
.83
0.79
8168
0.79
2471
0.79
6308
0.71
9110
0.71
9110
0.71
9110
+20
.00
Onl
ine
B(2
016)
1073
Oth
erY
ES21
.57
22.6
221
.65
0.74
3083
0.73
5203
0.74
0962
0.65
9950
0.65
9950
0.65
9950
—–
RBM
TD
(201
6)10
85O
ther
YES
23.0
224
.90
23.4
50.
7612
240.
7573
410.
7603
250.
6477
300.
6477
300.
6477
30—
–R
BMT
F(2
016)
1086
Oth
erY
ES26
.64
28.4
826
.84
0.77
3673
0.76
9244
0.77
3344
0.67
5470
0.67
5470
0.67
5470
+12
.75
RBM
TE
(201
6)10
87O
ther
YES
21.3
523
.17
21.5
30.
7434
840.
7419
850.
7423
000.
6469
300.
6469
300.
6469
30—
–O
nlin
eA
(201
6/11
)13
39N
MT
YES
50.6
051
.65
50.8
30.
8793
820.
8773
360.
8783
160.
7704
800.
7704
800.
7704
80+
48.5
0EH
R1
1406
NM
TN
O44
.44
45.5
944
.15
0.86
0998
0.85
8466
0.86
0659
0.74
7050
0.74
7050
0.74
7050
+58
.25
EHR
214
07N
MT
NO
44.6
345
.94
44.5
30.
8667
220.
8642
560.
8662
050.
7477
700.
7477
700.
7477
70+
60.0
0JA
PIO
114
54N
MT
YES
50.2
751
.23
50.1
70.
8864
030.
8834
810.
8857
470.
7767
900.
7767
900.
7767
90+
56.2
5JA
PIO
214
62SM
TY
ES51
.79
52.2
351
.75
0.86
4038
0.86
1596
0.86
2200
0.78
1150
0.78
1150
0.78
1150
+41
.00
u-tk
b1
1470
NM
TN
O38
.91
41.1
239
.11
0.84
5815
0.84
6888
0.84
5551
0.73
4010
0.73
4010
0.73
4010
+49
.50
Tabl
e28
:JP
C-E
Jsu
bmiss
ions
44
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
kyte
ast
anfo
rd(c
tb)
stan
ford
(pku
)ky
tea
stan
ford
(ctb
)st
anfo
rd(p
ku)
kyte
ast
anfo
rd(c
tb)
stan
ford
(pku
)SM
TPh
rase
966
SMT
NO
30.6
032
.03
31.2
50.
7873
210.
7978
880.
7943
880.
7109
400.
7109
400.
7109
40—
–SM
TH
iero
967
SMT
NO
30.2
631
.57
30.9
10.
7884
150.
7991
180.
7966
850.
7183
600.
7183
600.
7183
60+
4.75
SMT
S2T
968
SMT
NO
31.0
532
.35
31.7
00.
7938
460.
8028
050.
8008
480.
7200
300.
7200
300.
7200
30+
4.25
Onl
ine
A(2
016)
1038
Oth
erY
ES23
.02
23.5
723
.29
0.75
4241
0.76
0672
0.76
0148
0.70
2350
0.70
2350
0.70
2350
-23.
00O
nlin
eB
(201
6)10
69O
ther
YES
9.42
9.59
8.79
0.64
2026
0.65
1070
0.64
3520
0.52
7180
0.52
7180
0.52
7180
—–
RBM
TC
(201
6)11
18O
ther
YES
12.3
513
.72
13.1
70.
6882
400.
7086
810.
7002
100.
4754
300.
4754
300.
4754
30-4
1.25
Onl
ine
A(2
016/
11)
1340
NM
TY
ES33
.04
33.9
233
.34
0.82
4829
0.82
9122
0.82
9067
0.73
5470
0.73
5470
0.73
5470
+32
.50
u-tk
b1
1465
NM
TN
O31
.80
33.1
932
.83
0.81
9791
0.82
6055
0.82
5025
0.70
6720
0.70
6720
0.70
6720
+21
.75
Tabl
e29
:JP
C-J
Csu
bmiss
ions
45
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TH
iero
430
SMT
NO
39.2
239
.52
39.1
40.
8060
580.
8020
590.
8045
230.
7293
700.
7293
700.
7293
70—
–SM
TPh
rase
431
SMT
NO
38.3
438
.51
38.2
20.
7820
190.
7789
210.
7814
560.
7231
100.
7231
100.
7231
10—
–SM
TT
2S43
2SM
TN
O39
.39
39.9
039
.39
0.81
4919
0.81
1350
0.81
3595
0.72
5920
0.72
5920
0.72
5920
+20
.75
Onl
ine
A(2
015)
647
Oth
erY
ES26
.80
27.8
126
.89
0.71
2242
0.70
7264
0.71
1273
0.69
3840
0.69
3840
0.69
3840
-7.0
0O
nlin
eB
(201
5)64
8O
ther
YES
12.3
312
.72
12.4
40.
6489
960.
6412
550.
6487
420.
5883
800.
5883
800.
5883
80—
–R
BMT
A(2
015)
759
RBM
TN
O10
.49
10.7
210
.35
0.67
4060
0.66
4098
0.66
7349
0.55
7130
0.55
7130
0.55
7130
-39.
25R
BMT
B76
0R
BMT
NO
7.94
8.07
7.73
0.59
6200
0.58
1837
0.58
6941
0.50
2100
0.50
2100
0.50
2100
—–
Onl
ine
A(2
016)
1040
Oth
erY
ES26
.99
27.9
127
.02
0.70
7739
0.70
2718
0.70
6707
0.69
3720
0.69
3720
0.69
3720
-19.
75O
nlin
eA
(201
6/11
)13
41N
MT
YES
42.6
643
.76
42.9
50.
8458
580.
8449
180.
8457
940.
7472
400.
7472
400.
7472
40+
54.2
5EH
R1
1408
NM
TN
O47
.08
47.4
446
.83
0.85
9070
0.85
6376
0.85
8888
0.75
6350
0.75
6350
0.75
6350
+68
.25
EHR
214
14N
MT
NO
46.5
247
.17
46.3
50.
8596
190.
8567
840.
8583
530.
7613
700.
7613
700.
7613
70+
69.7
5JA
PIO
114
47SM
TY
ES50
.52
51.2
550
.57
0.84
7793
0.84
3774
0.84
6081
0.77
4660
0.77
4660
0.77
4660
+60
.50
JAPI
O2
1484
NM
TY
ES50
.06
50.5
150
.00
0.87
5398
0.87
3390
0.87
4822
0.77
9420
0.77
9420
0.77
9420
+80
.25
u-tk
b1
1468
NM
TN
O38
.79
40.4
738
.99
0.83
2144
0.83
3610
0.83
1209
0.72
9580
0.72
9580
0.72
9580
+55
.50
Tabl
e30
:JP
C-C
Jsu
bmiss
ions
46
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TPh
rase
438
SMT
NO
69.2
270
.36
69.7
30.
9413
020.
9397
290.
9407
560.
8562
200.
8562
200.
8562
20—
–SM
TH
iero
439
SMT
NO
67.4
168
.65
68.0
00.
9371
620.
9359
030.
9365
700.
8505
600.
8505
600.
8505
60+
2.75
Onl
ine
B(2
015)
651
Oth
erY
ES36
.41
38.7
237
.01
0.85
1745
0.85
2263
0.85
1945
0.72
8750
0.72
8750
0.72
8750
—–
Onl
ine
A(2
015)
652
Oth
erY
ES55
.05
56.8
455
.46
0.90
9152
0.90
9385
0.90
8838
0.80
0460
0.80
0460
0.80
0460
+38
.75
RBM
TA
(201
5)65
3O
ther
YES
42.0
043
.97
42.4
50.
8763
960.
8737
340.
8751
460.
7120
200.
7120
200.
7120
20-7
.25
RBM
TB
654
Oth
erY
ES34
.74
37.5
135
.54
0.84
5712
0.84
9014
0.84
6228
0.64
3150
0.64
3150
0.64
3150
—–
Onl
ine
A(2
015)
963
Oth
erY
ES55
.05
56.8
455
.46
0.90
9152
0.90
9385
0.90
8838
0.80
0610
0.80
0610
0.80
0610
—–
RBM
TA
(201
5)96
4O
ther
YES
42.0
043
.97
42.4
50.
8763
960.
8737
340.
8751
460.
7127
000.
7127
000.
7127
00—
–O
nlin
eA
(201
6)10
39O
ther
YES
54.7
856
.68
55.1
40.
9073
200.
9076
520.
9067
430.
7987
500.
7987
500.
7987
50+
8.00
Onl
ine
A(2
016/
11)
1344
NM
TN
O44
.42
45.1
444
.72
0.85
7642
0.85
4158
0.85
7083
0.78
3850
0.78
3850
0.78
3850
-55.
75EH
R1
1416
NM
TN
O71
.52
72.3
471
.82
0.94
4516
0.94
2940
0.94
4219
0.86
6060
0.86
6060
0.86
6060
+6.
25EH
R2
1417
NM
TN
O71
.36
72.2
671
.65
0.94
6126
0.94
4812
0.94
5888
0.87
1110
0.87
1110
0.87
1110
+11
.25
JAPI
O1
1448
SMT
YES
73.0
073
.71
73.2
30.
9468
800.
9457
540.
9466
450.
8725
100.
8725
100.
8725
10+
48.7
5JA
PIO
214
50SM
TY
ES73
.00
73.7
373
.25
0.94
6985
0.94
5841
0.94
6745
0.87
3200
0.87
3200
0.87
3200
+48
.50
Tabl
e31
:JP
C-K
Jsu
bmiss
ions
47
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
Onl
ine
A(2
016)
1031
Oth
erY
ES21
.37
0.71
4537
0.62
1100
+44
.75
Onl
ine
B(2
016)
1048
Oth
erY
ES15
.58
0.68
3214
0.59
0520
+14
.00
SMT
Phra
se10
54SM
TN
O10
.32
0.63
8090
0.57
4850
0.00
XM
UN
LP1
1511
NM
TN
O22
.44
0.75
0921
0.62
9530
+68
.25
IIT
B-M
TG
117
26N
MT
NO
11.5
50.
6829
020.
5570
40+
21.0
0
Tabl
e32
:II
TB-
HE
subm
issio
ns
SYST
EMID
IDM
ETH
OD
OT
HER
RES
OU
RC
ESBL
EUR
IBES
AM
FMPa
irO
nlin
eA
(201
6)10
32O
ther
YES
18.7
2000
00.
7167
880.
6706
60+
57.2
5O
nlin
eB
(201
6)10
47O
ther
YES
16.9
7000
00.
6912
980.
6684
50+
42.5
0SM
TPh
rase
1252
SMT
NO
10.7
9000
00.
6511
660.
6608
60—
–X
MU
NLP
115
76N
MT
NO
21.3
9000
00.
7496
600.
6887
70+
64.5
0II
TB-
MT
G1
1725
NM
TN
O12
.230
000
0.68
8606
0.62
4780
+28
.75
Tabl
e33
:II
TB-
EHsu
bmiss
ions
48
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
SMT
Phra
se13
94SM
TN
O15
.11
0.55
4550
0.47
5740
—–
SMT
Hie
ro13
96SM
TN
O15
.67
0.55
8225
0.47
0610
+10
.25
SMT
S2T
1398
SMT
NO
14.5
40.
5567
280.
4771
70—
–O
NLI
NE-
A1
1523
NM
TN
O8.
190.
5298
440.
4508
50+
70.0
0R
BMT
-A15
25R
BMT
NO
4.36
0.47
2312
0.39
1050
—–
RBM
T-B
1526
RBM
TN
O4.
670.
4757
600.
3856
00+
51.7
5N
TT
115
99N
MT
NO
19.4
40.
6388
410.
4762
00+
32.0
0N
TT
216
77N
MT
NO
20.9
00.
6489
310.
4743
60+
26.7
5N
ICT
-21
1473
NM
TN
O16
.52
0.64
2379
0.45
9000
+0.
25N
ICT
-22
1474
NM
TN
O18
.19
0.63
2638
0.45
3420
+7.
25X
MU
NLP
114
42N
MT
NO
17.9
50.
6370
590.
4657
10+
20.7
5C
UN
I116
68SM
TN
O10
.67
0.56
4797
0.43
2700
-24.
00
Tabl
e34
:JI
JI-J
Esu
bmiss
ions
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abSM
TPh
rase
1393
SMT
NO
15.7
716
.65
15.7
60.
5802
840.
5848
920.
5854
370.
5452
400.
5452
400.
5452
40—
–SM
TH
iero
1395
SMT
NO
16.2
216
.95
16.2
20.
5949
230.
6015
050.
6025
160.
5502
600.
5502
600.
5502
60+
10.2
5SM
TT
2S13
97SM
TN
O14
.95
15.3
814
.79
0.59
4072
0.59
7791
0.59
9530
0.53
0370
0.53
0370
0.53
0370
—–
RBM
T-A
1514
RBM
TN
O5.
316.
685.
690.
5052
270.
5150
500.
5135
800.
4739
400.
4739
400.
4739
40+
31.2
5R
BMT
-B15
15R
BMT
NO
4.72
5.98
4.97
0.51
8416
0.53
1603
0.53
2079
0.48
7320
0.48
7320
0.48
7320
—–
ON
LIN
E-A
115
18N
MT
NO
11.2
913
.12
11.8
40.
5974
730.
6055
320.
6033
740.
5331
200.
5331
200.
5331
20+
69.7
5N
TT
116
03N
MT
NO
19.1
320
.47
19.4
10.
6685
170.
6709
200.
6765
940.
5369
700.
5369
700.
5369
70+
14.5
0N
TT
216
79N
MT
NO
20.3
721
.82
20.6
80.
6805
980.
6840
480.
6888
630.
5378
000.
5378
000.
5378
00+
17.7
5X
MU
NLP
114
43N
MT
NO
19.6
120
.72
20.1
40.
6841
200.
6884
970.
6910
560.
5463
600.
5463
600.
5463
60+
11.7
5
Tabl
e35
:JI
JI-E
Jsu
bmiss
ions
49
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
RBM
T-A
1538
RBM
TN
O11
.14
0.48
4800
0.61
3950
—–
RBM
T-B
1539
RBM
TN
O12
.52
0.52
0800
0.60
7190
-24.
75O
NLI
NE-
B1
1544
NM
TN
O20
.33
0.56
3419
0.65
6630
-15.
50SM
TPh
rase
1571
SMT
NO
44.4
20.
8301
050.
8590
40—
–X
MU
NLP
116
35N
MT
NO
46.9
80.
8312
610.
8549
70+
3.50
Tabl
e36
:R
ECIP
EIN
G-J
Esu
bmiss
ions
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abR
BMT
-A15
36R
BMT
NO
4.52
4.74
4.29
0.36
5913
0.37
1584
0.35
5798
0.42
6550
0.42
6550
0.42
6550
—–
RBM
T-B
1537
RBM
TN
O5.
444.
955.
070.
3852
690.
3633
440.
3727
930.
4453
700.
4453
700.
4453
70-4
8.50
ON
LIN
E-B
115
42N
MT
NO
15.5
714
.89
14.8
50.
5480
260.
5445
810.
5371
470.
6180
100.
6180
100.
6180
10-2
4.25
SMT
Phra
se15
70SM
TN
O31
.39
30.6
129
.60
0.74
9305
0.74
0283
0.74
1249
0.77
5770
0.77
5770
0.77
5770
—–
XM
UN
LP1
1634
NM
TN
O34
.88
34.2
633
.19
0.74
7521
0.74
2770
0.73
9909
0.77
8530
0.77
8530
0.77
8530
-3.7
5
Tabl
e37
:R
ECIP
EIN
G-E
Jsu
bmiss
ions
50
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
RBM
T-A
1547
RBM
TN
O5.
370.
5466
420.
3159
30—
–R
BMT
-B15
48R
BMT
NO
5.82
0.56
5086
0.26
8580
-60.
25O
NLI
NE-
A1
1551
NM
TN
O11
.04
0.67
0484
0.41
5880
+2.
75SM
TPh
rase
1569
SMT
NO
22.8
40.
7055
060.
5952
90—
–X
MU
NLP
116
32N
MT
NO
28.0
30.
7842
350.
5980
50+
40.5
0
Tabl
e38
:R
ECIP
EST
E-JE
subm
issio
ns
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abR
BMT
-A15
45R
BMT
NO
3.06
4.55
3.43
0.51
8467
0.53
3326
0.52
1713
0.36
8710
0.36
8710
0.36
8710
—–
RBM
T-B
1546
RBM
TN
O3.
305.
194.
000.
5251
160.
5329
190.
5291
130.
4411
800.
4411
800.
4411
80-6
5.50
ON
LIN
E-A
115
49N
MT
NO
8.14
11.2
38.
970.
6238
650.
6358
880.
6290
120.
5266
800.
5266
800.
5266
80-2
9.25
SMT
Phra
se15
68SM
TN
O17
.60
21.4
318
.53
0.69
4179
0.69
8499
0.69
5400
0.62
6610
0.62
6610
0.62
6610
—–
XM
UN
LP1
1633
NM
TN
O22
.55
26.8
724
.00
0.77
6539
0.77
6469
0.77
5689
0.64
5050
0.64
5050
0.64
5050
+45
.50
Tabl
e39
:R
ECIP
EST
E-EJ
subm
issio
ns
51
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
RBM
T-A
1530
RBM
TN
O0.
530.
0863
780.
4333
80—
–R
BMT
-B15
31R
BMT
NO
0.56
0.10
0899
0.44
5400
-25.
75O
NLI
NE-
A1
1534
NM
TN
O2.
190.
1993
380.
5094
70+
10.2
5SM
TPh
rase
1567
SMT
NO
9.72
0.45
1707
0.57
1230
—–
XM
UN
LP1
1637
NM
TN
O15
.57
0.52
6993
0.54
2690
+10
.25
Tabl
e40
:R
ECIP
ETT
L-JE
subm
issio
ns
SYST
EMID
IDM
ETH
OD
OT
HER
BLEU
RIB
ESA
MFM
Pair
jum
anky
tea
mec
abju
man
kyte
am
ecab
jum
anky
tea
mec
abR
BMT
-A15
27R
BMT
NO
0.00
0.00
0.00
0.12
8134
0.13
7884
0.12
2371
0.18
7540
0.18
7540
0.18
7540
—–
RBM
T-B
1528
RBM
TN
O2.
572.
302.
080.
2991
330.
3005
780.
2704
820.
4028
300.
4028
300.
4028
30-5
0.25
ON
LIN
E-B
115
33N
MT
NO
16.1
615
.85
15.4
00.
5737
710.
5591
420.
5321
300.
5904
400.
5904
400.
5904
40+
3.75
SMT
Phra
se15
66SM
TN
O17
.16
16.2
316
.57
0.60
0503
0.57
6617
0.54
8811
0.57
1650
0.57
1650
0.57
1650
—–
XM
UN
LP1
1636
NM
TN
O19
.41
18.8
718
.78
0.59
2087
0.57
3466
0.55
8997
0.58
4980
0.58
4980
0.58
4980
+23
.75
Tabl
e41
:R
ECIP
ETT
L-EJ
subm
issio
ns
52
ReferencesRafael E. Banchs, Luis F. D’Haro, and Haizhou
Li. 2015. Adequacy-fluency metrics: Evaluat-ing mt in the continuous space model frame-work. IEEE/ACM Trans. Audio, Speech andLang. Proc., 23(3):472–482.
Jacob Cohen. 1968. Weighted kappa: Nominalscale agreement with provision for scaled dis-agreement or partial credit. Psychological Bul-letin, 70(4):213 – 220.
Fabien Cromieres, Raj Dabre, Toshiaki Nakazawa,and Sadao Kurohashi. 2017. Kyoto universityparticipation to wat 2017. In Proceedings of the4th Workshop on Asian Translation (WAT2017),pages 146–153, Taipei, Taiwan. Asian Federa-tion of Natural Language Processing.
Terumasa Ehara. 2017. Smt reranked nmt. In Pro-ceedings of the 4th Workshop on Asian Trans-lation (WAT2017), pages 119–126, Taipei, Tai-wan. Asian Federation of Natural Language Pro-cessing.
J.L. Fleiss et al. 1971. Measuring nominal scaleagreement among many raters. PsychologicalBulletin, 76(5):378–382.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.Clark, and Philipp Koehn. 2013. Scalable mod-ified kneser-ney language model estimation. InProceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Vol-ume 2: Short Papers), pages 690–696, Sofia,Bulgaria. Association for Computational Lin-guistics.
Hieu Hoang, Philipp Koehn, and Adam Lopez.2009. A unified framework for phrase-based,hierarchical, and syntax-based statistical ma-chine translation. In Proceedings of the Inter-national Workshop on Spoken Language Trans-lation, pages 152–159.
Kenji Imamura and Eiichiro Sumita. 2017. En-semble and reranking: Using multiple models inthe nict-2 neural machine translation system atwat2017. In Proceedings of the 4th Workshopon Asian Translation (WAT2017), pages 127–134, Taipei, Taiwan. Asian Federation of Natu-ral Language Processing.
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Kat-suhito Sudoh, and Hajime Tsukada. 2010. Au-tomatic evaluation of translation quality for dis-tant language pairs. In Proceedings of the 2010Conference on Empirical Methods in NaturalLanguage Processing, EMNLP ’10, pages 944–952, Stroudsburg, PA, USA. Association forComputational Linguistics.
Satoshi Kinoshita, Tadaaki Oshio, and TomoharuMitsuhashi. 2017. Comparison of smt and nmttrained with large patent corpora: Japio at
wat2017. In Proceedings of the 4th Workshopon Asian Translation (WAT2017), pages 140–145, Taipei, Taiwan. Asian Federation of Natu-ral Language Processing.
Tom Kocmi, Dušan Variš, and Ondřej Bojar. 2017.Cuni nmt system for wat 2017 translation tasks.In Proceedings of the 4th Workshop on AsianTranslation (WAT2017), pages 154–159, Taipei,Taiwan. Asian Federation of Natural LanguageProcessing.
Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In Proceed-ings of EMNLP 2004, pages 388–395, Barcelona,Spain. Association for Computational Linguis-tics.
Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondrej Bojar,Alexandra Constantin, and Evan Herbst. 2007.Moses: Open source toolkit for statistical ma-chine translation. In Annual Meeting of the As-sociation for Computational Linguistics (ACL),demonstration session.
T. Kudo. 2005. Mecab : Yet anotherpart-of-speech and morphological analyzer.http://mecab.sourceforge.net/.
Sadao Kurohashi, Toshihisa Nakamura, Yuji Mat-sumoto, and Makoto Nagao. 1994. Improve-ments of Japanese morphological analyzer JU-MAN. In Proceedings of The InternationalWorkshop on Sharable Natural Language, pages22–28.
Zi Long, Ryuichiro Kimura, Takehito Utsuro,Tomoharu Mitsuhashi, and Mikio Yamamoto.2017. Patent nmt integrated with large vocab-ulary phrase translation by smt at wat 2017.In Proceedings of the 4th Workshop on AsianTranslation (WAT2017), pages 110–118, Taipei,Taiwan. Asian Federation of Natural LanguageProcessing.
Yukio Matsumura and Mamoru Komachi. 2017.Tokyo metropolitan university neural machinetranslation system for wat 2017. In Proceed-ings of the 4th Workshop on Asian Transla-tion (WAT2017), pages 160–166, Taipei, Tai-wan. Asian Federation of Natural Language Pro-cessing.
Makoto Morishita, Jun Suzuki, and Masaaki Na-gata. 2017. Ntt neural machine translation sys-tems at wat 2017. In Proceedings of the 4thWorkshop on Asian Translation (WAT2017),pages 89–94, Taipei, Taiwan. Asian Federationof Natural Language Processing.
Toshiaki Nakazawa, Chenchen Ding, HideyaMINO, Isao Goto, Graham Neubig, and Sadao
53
Kurohashi. 2016. Overview of the 3rd work-shop on asian translation. In Proceedings of the3rd Workshop on Asian Translation (WAT2016),pages 1–46, Osaka, Japan. The COLING 2016Organizing Committee.
Toshiaki Nakazawa, Hideya Mino, Isao Goto,Sadao Kurohashi, and Eiichiro Sumita. 2014.Overview of the 1st Workshop on Asian Trans-lation. In Proceedings of the 1st Workshopon Asian Translation (WAT2014), pages 1–19,Tokyo, Japan.
Toshiaki Nakazawa, Hideya Mino, Isao Goto, Gra-ham Neubig, Sadao Kurohashi, and EiichiroSumita. 2015. Overview of the 2nd Workshopon Asian Translation. In Proceedings of the 2ndWorkshop on Asian Translation (WAT2015),pages 1–28, Kyoto, Japan.
Masato Neishi, Jin Sakuma, Satoshi Tohda, Shon-osuke Ishiwatari, Naoki Yoshinaga, and MasashiToyoda. 2017. A bag of useful tricks for prac-tical neural machine translation: Embeddinglayer initialization and large batch size. In Pro-ceedings of the 4th Workshop on Asian Transla-tion (WAT2017), pages 99–109, Taipei, Taiwan.Asian Federation of Natural Language Process-ing.
Graham Neubig, Yosuke Nakata, and ShinsukeMori. 2011. Pointwise prediction for robust,adaptable japanese morphological analysis. InProceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: Hu-man Language Technologies: Short Papers - Vol-ume 2, HLT ’11, pages 529–533, Stroudsburg,PA, USA. Association for Computational Lin-guistics.
Yusuke Oda, Katsuhito Sudoh, Satoshi Nakamura,Masao Utiyama, and Eiichiro Sumita. 2017. Asimple and strong baseline: Naist-nict neuralmachine translation system for wat2017 english-japanese translation task. In Proceedings of the4th Workshop on Asian Translation (WAT2017),pages 135–139, Taipei, Taiwan. Asian Federa-tion of Natural Language Processing.
Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. Bleu: a method for au-tomatic evaluation of machine translation. InACL, pages 311–318.
Slav Petrov, Leon Barrett, Romain Thibaux, andDan Klein. 2006. Learning accurate, compact,and interpretable tree annotation. In Pro-ceedings of the 21st International Conferenceon Computational Linguistics and 44th AnnualMeeting of the Association for ComputationalLinguistics, pages 433–440, Sydney, Australia.Association for Computational Linguistics.
Sandhya Singh, Ritesh Panjwani, AnoopKunchukuttan, and Pushpak Bhattacharyya.
2017. Comparing recurrent and convolutionalarchitectures for english-hindi neural machinetranslation. In Proceedings of the 4th Work-shop on Asian Translation (WAT2017), pages167–170, Taipei, Taiwan. Asian Federation ofNatural Language Processing.
Huihsin Tseng. 2005. A conditional random fieldword segmenter. In In Fourth SIGHAN Work-shop on Chinese Language Processing.
Masao Utiyama and Hitoshi Isahara. 2007. Ajapanese-english patent parallel corpus. In MTsummit XI, pages 475–482.
Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. CoRR, abs/1706.03762.
Boli Wang, Zhixing Tan, Jinming Hu, YidongChen, and xiaodong shi. 2017. Xmu neural ma-chine translation systems for wat 2017. In Pro-ceedings of the 4th Workshop on Asian Transla-tion (WAT2017), pages 95–98, Taipei, Taiwan.Asian Federation of Natural Language Process-ing.
54