graph-based bilingual phrase sense disambiguation for statistical machine translation mamoru komachi...

Graph-based Bilingual Phrase Sense Disambiguation for

Statistical Machine Translation

Mamoru [email protected]

2008-06-04

Background

Success of supervised ML methods depends on annotated corpus Hard to maintain

Weakly supervised method requires only small amount of tagged data Can reduce amount of human effort

Problems remained

WSD is crucial to weakly supervised method Cf. semantic drift

Parallel corpora (and dictionaries) may help disambiguate word senses WSD models in SMT systems gain much

attention (Carpuat and Wu, 2007)

WSD and SMT

Improving Statistical Machine Translation using Word Sense Disambiguation (Carpuat and Wu, EMNLP 2007) SMT is known to suffer from inaccurate

lexical choice (based on senseval style sense inventory)

Domain adaptation problem Input is typically a word Limited contextual features

Phrase-based WSD models for SMT

Sense annotations are derived from phrase alignment learned during SMT training WSD senses are from the SMT phrasal

translation lexicon “phrase table” Not only words but also phrases are to

be disambiguated Supervised WSD (an ensemble methods

of naïve Bayes, ME, boosting and a Kernel PCA)

Phrase table is highly ambiguous

Phrase table constructed from NTCIR-7 J-E parallel corpus 0.5~1GB (in gzip format) 2.53 candidates per phrase (3.24

candidates per phrase for phrases shorter than 5 words)

Includes function words as wellPlant ||| 工場Plant ||| 植物Plant ||| 設備Plant ||| 発電プラント

Plant ||| 工場内Plant ||| 制御対象Plant ||| 動植物Plant ||| 供給プラント

Plant has 120 translations in the phrase table!

Motivation

Propose a novel graph-based approach to phrase sense disambiguation Can exploit bilingual contextual patterns

Evaluate phrase sense disambiguation on SMT framework

Monolingual bootstrapping

Pioneered by (Yarowsky, 1995) Learn decision lists from a small set of

seed instances (input: instance I, output: classifier)

‘One sense per discourse’ constraint

23/04/18 9

Bootstrapping Iteratively conduct pattern induction and

instance extraction starting from seed instances

Can fertilize small set of seed instances

Instances Contextual patterns

Query log(Corpus)

vaio Compare vaio laptop

Compare # laptopCompare toshiba satellite laptop

Compare HP xb3000 laptop

Toshiba satellite

HP xb3000#:slot

Bilingual bootstrapping

Word Translation Disambiguation Using Bilingual Bootstrapping (Li and Li, ACL-2002)

…MillPlantVegetable…

…工場植物…

corpus コーパスWSD classifier

23/04/18 11

Formalization of bootstrapping Score vector of seed instance

Pattern-instance matrix P

Iterate

Output ranked instances when stopping criterion met

€

i0 = 0,...,1,...,0( )

€

P =

0 1 0 0

1 0 0 0

1 1 0 1

0 1 1 1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

行列の (p,i) 要素はパターン p とインスタンス i の

共起

€

pn = Pin

in+1 = PT pn

インスタンスの類似度行列を A=PTPとして、このステップを再帰的に行うと

in=Ani0

インスタンスを最終スコア順に出力

Bilingual phrase sense disambiguation

The (p,i) element of a pattern-instance matrix P is a co-occurrence between pattern p and instance I p: contextual features of both language

sides, with phrase alignment from GIZA++

i: candidate (monolingual) phrase to disambiguate

A = PTP Similarity is given by the regularized

Laplacian kernel

23/04/18 13

Regularized Laplacian kernel Predict final sense by k-NN given the

target instance

€

L = D− A

D(i,i) = A(i, j)j

∑

€

Kβ = A β nAn

n= 0

∞

∑

グラフ G のラプラシアン L

次数対角行列 D の i 番目の対角要素

ノイマンカーネル行列

において A の代わりに -L を使用、右辺第一項の A を削除€

Rβ = β n (−L)n = (I + βL)−1

n= 0

∞

∑正則化ラプラシアン行列 Rβ

A: 隣接行列β: 拡散係数

NTCIR-7 Patent Translation Task

Large-scale Japanese-English parallel corpus 2M sentences (comparable to A-E, C-E

MT) Mainly technical documents

Timeline 2008.01: dry run 2008.05: formal run 2008.12: final meeting

NAIST-NTT at NTCIR-7

Bilingual dictionary extracted (solely) from Wikipedia Used langlinks from Wikipedia DB 1:n translations are expanded to n pairs

of bilingual phrase (en, ja) Extracted ~200,000 pairs

12,000 pairs appear in the training corpus (8.8%)

44.7% of words (token) is covered by automatically constructed bilingual lexicon (GIZA++)

Learned 1,193 (0.6%) novel translation

Proposed method (but not yet finished training…)

Extract translation pairs relevant to the given domain (patent translation task)

Construct a pattern-instance matrix P Pattern features: bag-of-words feature and

link features extracted from Wikipedia ja-en abstract

Instance: translation pair (en, ja) Seed instances: 40 translation pairs from the

target domain Apply Laplacian kernel

Evaluation (BLEU score)-Wikipedia

+Wikipedia

Fmlrun-int.en.out 28.24 27.28

Fmlrun-int.ja.out 26.39 26.48

Fmlrun-int.ja.out.recased

25.34 25.47

Fmlrun-int.ja-out.detokenized

20.38 20.52

•E-J translation(en) gets worse with Wikipedia dictionary•J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without

Future work

Implement bilingual phrase sense disambiguation

Evaluate this method against IWSLT 2006 J-E/E-J and NTCIR-7 J-E/E-J datasets

Future work(2)

Automatic extraction of biomedical lexicon starting from life-science dictionary (mining from MedLine, etc…)

Summarization (Harendra’s work)…

graph-based bilingual phrase sense disambiguation for statistical machine translation mamoru komachi...

Documents