graph-based bilingual phrase sense disambiguation for statistical machine translation mamoru komachi...
TRANSCRIPT
Graph-based Bilingual Phrase Sense Disambiguation for
Statistical Machine Translation
Mamoru [email protected]
2008-06-04
Background
Success of supervised ML methods depends on annotated corpus Hard to maintain
Weakly supervised method requires only small amount of tagged data Can reduce amount of human effort
Problems remained
WSD is crucial to weakly supervised method Cf. semantic drift
Parallel corpora (and dictionaries) may help disambiguate word senses WSD models in SMT systems gain much
attention (Carpuat and Wu, 2007)
WSD and SMT
Improving Statistical Machine Translation using Word Sense Disambiguation (Carpuat and Wu, EMNLP 2007) SMT is known to suffer from inaccurate
lexical choice (based on senseval style sense inventory)
Domain adaptation problem Input is typically a word Limited contextual features
Phrase-based WSD models for SMT
Sense annotations are derived from phrase alignment learned during SMT training WSD senses are from the SMT phrasal
translation lexicon “phrase table” Not only words but also phrases are to
be disambiguated Supervised WSD (an ensemble methods
of naïve Bayes, ME, boosting and a Kernel PCA)
Phrase table is highly ambiguous
Phrase table constructed from NTCIR-7 J-E parallel corpus 0.5~1GB (in gzip format) 2.53 candidates per phrase (3.24
candidates per phrase for phrases shorter than 5 words)
Includes function words as wellPlant ||| 工場Plant ||| 植物Plant ||| 設備Plant ||| 発電 プラント
Plant ||| 工場 内Plant ||| 制御 対象Plant ||| 動植物Plant ||| 供給 プラント
Plant has 120 translations in the phrase table!
Motivation
Propose a novel graph-based approach to phrase sense disambiguation Can exploit bilingual contextual patterns
Evaluate phrase sense disambiguation on SMT framework
Monolingual bootstrapping
Pioneered by (Yarowsky, 1995) Learn decision lists from a small set of
seed instances (input: instance I, output: classifier)
‘One sense per discourse’ constraint
23/04/18 9
Bootstrapping Iteratively conduct pattern induction and
instance extraction starting from seed instances
Can fertilize small set of seed instances
Instances Contextual patterns
Query log(Corpus)
vaio Compare vaio laptop
Compare # laptopCompare toshiba satellite laptop
Compare HP xb3000 laptop
Toshiba satellite
HP xb3000#:slot
Bilingual bootstrapping
Word Translation Disambiguation Using Bilingual Bootstrapping (Li and Li, ACL-2002)
…MillPlantVegetable…
…工場植物…
corpus コーパスWSD classifier
23/04/18 11
Formalization of bootstrapping Score vector of seed instance
Pattern-instance matrix P
Iterate
Output ranked instances when stopping criterion met
€
i0 = 0,...,1,...,0( )
€
P =
0 1 0 0
1 0 0 0
1 1 0 1
0 1 1 1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
行列の (p,i) 要素はパターン p とインスタンス i の
共起
€
pn = Pin
in+1 = PT pn
インスタンスの類似度行列を A=PTPとして、このステップを再帰的に行うと
in=Ani0
インスタンスを最終スコア順に出力
Bilingual phrase sense disambiguation
The (p,i) element of a pattern-instance matrix P is a co-occurrence between pattern p and instance I p: contextual features of both language
sides, with phrase alignment from GIZA++
i: candidate (monolingual) phrase to disambiguate
A = PTP Similarity is given by the regularized
Laplacian kernel
23/04/18 13
Regularized Laplacian kernel Predict final sense by k-NN given the
target instance
€
L = D− A
D(i,i) = A(i, j)j
∑
€
Kβ = A β nAn
n= 0
∞
∑
グラフ G のラプラシアン L
次数対角行列 D の i 番目の対角要素
ノイマンカーネル行列
において A の代わりに -L を使用、右辺第一項の A を削除€
Rβ = β n (−L)n = (I + βL)−1
n= 0
∞
∑正則化ラプラシアン行列 Rβ
A: 隣接行列β: 拡散係数
NTCIR-7 Patent Translation Task
Large-scale Japanese-English parallel corpus 2M sentences (comparable to A-E, C-E
MT) Mainly technical documents
Timeline 2008.01: dry run 2008.05: formal run 2008.12: final meeting
NAIST-NTT at NTCIR-7
Bilingual dictionary extracted (solely) from Wikipedia Used langlinks from Wikipedia DB 1:n translations are expanded to n pairs
of bilingual phrase (en, ja) Extracted ~200,000 pairs
12,000 pairs appear in the training corpus (8.8%)
44.7% of words (token) is covered by automatically constructed bilingual lexicon (GIZA++)
Learned 1,193 (0.6%) novel translation
Proposed method (but not yet finished training…)
Extract translation pairs relevant to the given domain (patent translation task)
Construct a pattern-instance matrix P Pattern features: bag-of-words feature and
link features extracted from Wikipedia ja-en abstract
Instance: translation pair (en, ja) Seed instances: 40 translation pairs from the
target domain Apply Laplacian kernel
Evaluation (BLEU score)-Wikipedia
+Wikipedia
Fmlrun-int.en.out 28.24 27.28
Fmlrun-int.ja.out 26.39 26.48
Fmlrun-int.ja.out.recased
25.34 25.47
Fmlrun-int.ja-out.detokenized
20.38 20.52
•E-J translation(en) gets worse with Wikipedia dictionary•J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without
Future work
Implement bilingual phrase sense disambiguation
Evaluate this method against IWSLT 2006 J-E/E-J and NTCIR-7 J-E/E-J datasets
Future work(2)
Automatic extraction of biomedical lexicon starting from life-science dictionary (mining from MedLine, etc…)
Summarization (Harendra’s work)…