graph-based bilingual phrase sense disambiguation for statistical machine translation mamoru komachi...

19
Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi [email protected] 2008-06-04

Upload: noel-west

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Graph-based Bilingual Phrase Sense Disambiguation for

Statistical Machine Translation

Mamoru [email protected]

2008-06-04

Page 2: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Background

Success of supervised ML methods depends on annotated corpus Hard to maintain

Weakly supervised method requires only small amount of tagged data Can reduce amount of human effort

Page 3: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Problems remained

WSD is crucial to weakly supervised method Cf. semantic drift

Parallel corpora (and dictionaries) may help disambiguate word senses WSD models in SMT systems gain much

attention (Carpuat and Wu, 2007)

Page 4: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

WSD and SMT

Improving Statistical Machine Translation using Word Sense Disambiguation (Carpuat and Wu, EMNLP 2007) SMT is known to suffer from inaccurate

lexical choice (based on senseval style sense inventory)

Domain adaptation problem Input is typically a word Limited contextual features

Page 5: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Phrase-based WSD models for SMT

Sense annotations are derived from phrase alignment learned during SMT training WSD senses are from the SMT phrasal

translation lexicon “phrase table” Not only words but also phrases are to

be disambiguated Supervised WSD (an ensemble methods

of naïve Bayes, ME, boosting and a Kernel PCA)

Page 6: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Phrase table is highly ambiguous

Phrase table constructed from NTCIR-7 J-E parallel corpus 0.5~1GB (in gzip format) 2.53 candidates per phrase (3.24

candidates per phrase for phrases shorter than 5 words)

Includes function words as wellPlant ||| 工場Plant ||| 植物Plant ||| 設備Plant ||| 発電 プラント

Plant ||| 工場 内Plant ||| 制御 対象Plant ||| 動植物Plant ||| 供給 プラント

Plant has 120 translations in the phrase table!

Page 7: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Motivation

Propose a novel graph-based approach to phrase sense disambiguation Can exploit bilingual contextual patterns

Evaluate phrase sense disambiguation on SMT framework

Page 8: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Monolingual bootstrapping

Pioneered by (Yarowsky, 1995) Learn decision lists from a small set of

seed instances (input: instance I, output: classifier)

‘One sense per discourse’ constraint

Page 9: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

23/04/18 9

Bootstrapping Iteratively conduct pattern induction and

instance extraction starting from seed instances

Can fertilize small set of seed instances

Instances Contextual patterns

Query log(Corpus)

vaio Compare vaio laptop

Compare # laptopCompare toshiba satellite laptop

Compare HP xb3000 laptop

Toshiba satellite

HP xb3000#:slot

Page 10: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Bilingual bootstrapping

Word Translation Disambiguation Using Bilingual Bootstrapping (Li and Li, ACL-2002)

…MillPlantVegetable…

…工場植物…

corpus コーパスWSD classifier

Page 11: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

23/04/18 11

Formalization of bootstrapping Score vector of seed instance

Pattern-instance matrix P

Iterate

Output ranked instances when stopping criterion met

i0 = 0,...,1,...,0( )

P =

0 1 0 0

1 0 0 0

1 1 0 1

0 1 1 1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

行列の (p,i) 要素はパターン p とインスタンス i の

共起

pn = Pin

in+1 = PT pn

インスタンスの類似度行列を A=PTPとして、このステップを再帰的に行うと

in=Ani0

インスタンスを最終スコア順に出力

Page 12: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Bilingual phrase sense disambiguation

The (p,i) element of a pattern-instance matrix P is a co-occurrence between pattern p and instance I p: contextual features of both language

sides, with phrase alignment from GIZA++

i: candidate (monolingual) phrase to disambiguate

A = PTP Similarity is given by the regularized

Laplacian kernel

Page 13: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

23/04/18 13

Regularized Laplacian kernel Predict final sense by k-NN given the

target instance

L = D− A

D(i,i) = A(i, j)j

Kβ = A β nAn

n= 0

グラフ G のラプラシアン L

次数対角行列 D の i 番目の対角要素

ノイマンカーネル行列

において A の代わりに -L を使用、右辺第一項の A を削除€

Rβ = β n (−L)n = (I + βL)−1

n= 0

∑正則化ラプラシアン行列 Rβ

A: 隣接行列β: 拡散係数

Page 14: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

NTCIR-7 Patent Translation Task

Large-scale Japanese-English parallel corpus 2M sentences (comparable to A-E, C-E

MT) Mainly technical documents

Timeline 2008.01: dry run 2008.05: formal run 2008.12: final meeting

Page 15: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

NAIST-NTT at NTCIR-7

Bilingual dictionary extracted (solely) from Wikipedia Used langlinks from Wikipedia DB 1:n translations are expanded to n pairs

of bilingual phrase (en, ja) Extracted ~200,000 pairs

12,000 pairs appear in the training corpus (8.8%)

44.7% of words (token) is covered by automatically constructed bilingual lexicon (GIZA++)

Learned 1,193 (0.6%) novel translation

Page 16: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Proposed method (but not yet finished training…)

Extract translation pairs relevant to the given domain (patent translation task)

Construct a pattern-instance matrix P Pattern features: bag-of-words feature and

link features extracted from Wikipedia ja-en abstract

Instance: translation pair (en, ja) Seed instances: 40 translation pairs from the

target domain Apply Laplacian kernel

Page 17: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Evaluation (BLEU score)-Wikipedia

+Wikipedia

Fmlrun-int.en.out 28.24 27.28

Fmlrun-int.ja.out 26.39 26.48

Fmlrun-int.ja.out.recased

25.34 25.47

Fmlrun-int.ja-out.detokenized

20.38 20.52

•E-J translation(en) gets worse with Wikipedia dictionary•J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without

Page 18: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Future work

Implement bilingual phrase sense disambiguation

Evaluate this method against IWSLT 2006 J-E/E-J and NTCIR-7 J-E/E-J datasets

Page 19: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi mamoru-k@is.naist.jp 2008-06-04

Future work(2)

Automatic extraction of biomedical lexicon starting from life-science dictionary (mining from MedLine, etc…)

Summarization (Harendra’s work)…