accurate parallel fragment extraction from quasi-comparable corpora using alignment model and...
TRANSCRIPT
Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora
using Alignment Model and Translation Lexicon
Chenhui Chu, Toshiaki Nakazawa, Sadao KurohashiGraduate School of Informatics, Kyoto University
IJCNLP2013 (2013/10/17)1
Outline
• Background• Related Work• Proposed Method• Experiments• Conclusion
2
Outline
• Background• Related Work• Proposed Method• Experiments• Conclusion
3
Bilingual Corpora [Fung+ 2004]
Type Definition Example
Parallel Sentence-aligned bilingual corpora Europarl
Noisy Parallel Bilingual translations of documents Patent family
Comparable Topic-aligned bilingual documents Wikipedia
Quasi-Comparable Very-non-parallel bilingual documents this study
4
• Lack of parallel corpora• Parallel sentences can be extracted from noisy
and comparable corpora• Quasi-comparable corpora more available,
however few parallel sentences exist
Parallel Fragments
• In quasi-comparable corpora, there could be parallel fragments in comparable sentences
• Parallel fragments are also helpful for SMT• We aim to accurately extract parallel fragments
from comparable sentences
应用 /铅 /离子 /选择 /电极 /电位 /滴定 /法 /测定 /甘草 /及 /其 /制品 /中/的 /甘草 /酸(Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid)< / 原 / 報 / > /鉛 /イオン /選択 /性 /電極を / 用いる / 混合 / 試料 / 中 / の/…/ と /電位 /差 /滴定 /法 / の / 比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)
Zh:
Ja:
5
Outline
• Background• Related Work• Proposed Method• Experiments• Conclusion
6
Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006]
1. Extract translation lexicon from a parallel corpus
2. Apply a lexicon filter to comparable sentences in two directions independently– Assign initial scores according to the lexicon– Score smoothing to gain new knowledge that
does not exist in the lexicon
3. Extract sub-sentential (not exactly parallel) fragment
7
8
应用
铅 离子
选择
电极电位
滴定法 测
定甘草及 其 制
品中 的 甘
草酸
< 原 報 > 鉛 イオン
選択性 電極
を 用いる
混合
試料
中 の と 電位差 滴定法 の 比
較
Lexicon Filter on Ja-to-Zh Direction
-1.5
-1
-0.5
0
0.5
1
1.5 Initial score Smoothed score
9
应用
铅 离子
选择
电极电位
滴定法 测
定甘草及 其 制
品中 的 甘
草酸
< 原 報 > 鉛 イオン
選択性 電極
を 用いる
混合
試料
中 の と 電位差 滴定法 の 比
較
Lexicon Filter on Zh-to-Ja Direction
-1.5
-1
-0.5
0
0.5
1
1.5 Initial score Smoothed score
Outline
• Background• Related Work• Proposed Method• Experiments• Conclusion
10
System Overview
Translated sentences
Comparable sentences
ParallelfragmentsSource
corpora
Target corpora Classifier
(2) IR: top N results
(1)(3) (4)
Alignment
Parallel corpus
Parallelfragmentcandidates
Lexiconfilter
(5)
SMT
11
Use an alignment model to locate the source and target fragment candidates simultaneously
Use a more accurate lexicon filter
Parallel Fragment Candidate Detection by Alignment
Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12
Lexicon Filter − Assign Initial Scores
13
Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon
Lexicon Filter − Score Smoothing
14
Only smooth a word with negative score when boththe left and right words around it have positive scores
Fragment Extraction
15
Fragments more than 3 tokens with continuous positivescores in both directions
Outline
• Background• Related Work• Proposed Method• Experiments– Parallel Fragment Extraction– Translation
• Conclusion
16
Experimental settings (Parallel Fragment Extraction 1/2)
• Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain)
• Quasi-Comparable Corpora– Chinese corpora: CNKI (90k articles, 420k sentences,
chemistry domain)– Japanese corpora: CiNii (880k articles, 5M sentences,
scientific domain)
• Comparable sentences: 30k chemistry domain sentences were extracted
17
Experimental settings (Parallel Fragment Extraction 2/2)
• Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences– External: together with 11k chemistry domain data in the
parallel corpus• Translation lexicon
– IBM Model 1 [Brown+ 1993]– Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012]
• Compare with [Munteanu+ 2006]
18
Results
Method # fragments Avg size (Zh/Ja) Accuracy
[Munteanu+ 2006] 28.4k 20.36/21.39 (1%)
Only (IBM Model 1) 18.9k 4.03/4.14 80%
Only (LLR) 18.3k 4.00/4.14 89%
Only (SampLEX) 18.4k 3.96/4.05 87%
External (IBM Model 1) 28.7k 4.18/4.33 81%
External (LLR) 26.9k 4.17/4.33 85%
External (SampLEX) 28.0k 4.11/4.23 82%
※ Accuracy: manually evaluated 100 fragments based on exact match
19
Experimental Settings (Translation)
• Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences)
• Tuning: 368 sentences of chemistry domain• Testing: 367 sentences of chemistry domain• Decoder: Moses• Language model: 5–gram language model on the Ja
side of the parallel corpus using SRILM
• Compare MT performance by appending the extracted fragments to the baseline training data
20
BLUE-4 for Different Systems
21
“※ *” denotes that the result is better than “Baseline” significantly at p < 0.05
* **
*
Baselin
e
+Sentence
+Muntean
u+ 2006
+Only (IB
M M
odel 1)
+Only (LL
R)
+Only (Sa
mpLEX)
+Extern
al (IB
M M
odel 1)
+Extern
al (LL
R)
+Extern
al (Sa
mpLEX)
38
38.4
38.8
39.2
39.6
40
Outline
• Background• Related Work• Proposed Method• Experiments• Conclusion
22
Conclusion
• We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon
• Future Work– A method to deal with ordering– Parallel corpus independent method– Try other language pairs and domains
23
Thank you for your attention!
Examples of Extracted Fragment Pairs
25
ID Zh Fragment Ja Fragment
1 直接甲醇燃料电池 直接メタノール燃料電池2 X射线光电子能谱(XPS) X線光電子分光法(XPS)3 (OH)24(H2O)12] (OH)24(H2O)12]4 的原生质体融合 のプロトプラスト融合5 分子动力学(MD)模拟了 分子動力学(MD)シミュレー
ションを6 扫描电子显微镜(SEM)、透射电子显微镜(TEM)
型電子顕微鏡(SEM),透過型電子顕微鏡(TEM)
7 证明了本算法的 から本アルゴリズムの8 X射线粉末衍射 X線回折分析
※ Noise is written in red font• Most noise is due to the noisy translation lexicon (Example 5-7)• Score smoothing also produces some noise (Example 8)