cross-lingual latent semantic analysis for language...

Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 1/12

Cross-Lingual Latent Semantic Analysis forLanguage Modeling

May 21, 2004

Woosung Kim and Sanjeev Khudanpur

Center for Language and Speech Processing

Dept. of Computer Science

The Johns Hopkins University

Baltimore, MD 21218, USA

Introduction

Cross-Lingual LM for ASR

Model Estimation

LSA for CLIR

LSA for MT

Corpora

Experimental Results

Conclusions

References


Introduction

Motivation :Success of statistical modeling techniques

Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German

How to construct stochastic models in resource-deficientlanguages?

➔ Bootstrap from other languages, e.g.Universal phone-set for Automatic Speech Recognition(ASR)[Schultz & Waibel, 98, Byrne et al, 00]Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.[Yarowsky, Ngai & Wicentowski, 01]Cross-Lingual language modeling for ASR[Khudanpur & Kim 04]

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


Introduction

Motivation :Success of statistical modeling techniques

Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German

How to construct stochastic models in resource-deficientlanguages? ➔ Bootstrap from other languages, e.g.

Universal phone-set for Automatic Speech Recognition(ASR)[Schultz & Waibel, 98, Byrne et al, 00]Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.[Yarowsky, Ngai & Wicentowski, 01]Cross-Lingual language modeling for ASR[Khudanpur & Kim 04]


Introduction

Overview of [Khudanpur & Kim 04] :An approach to sharpen an LM in a resource-deficient languagebased on comparable texts from resource-rich languagesStory-specific language modeling from contemporaneous textIntegration of machine translation (MT), cross-language informationretrieval (CLIR), and language modeling (LM)A sentence-aligned parallel corpus is needed to build an MT dictionary=⇒ Expensive to obtain (esp. in resource-deficient languages)

We present a method:To use Latent Sematic Analysis (LSA) for CLIR and MT

Document-aligned parallel corpus is enoughNo explicit MT dictionary is needed for CLIR

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References



Automatic Speech Recognition

Cross-Language Information Retrieval

Statistical Machine Translation

Baseline Chinese Acoustic Model

Chinese Dictionary (Vocabulary)

Baseline Chinese Language Model

Translation Lexicons

Cross-Language Unigram Model

Automatic Transcription

Contemporaneous English Articles

English Article Aligned with Mandarin Story

C i d

E i d

) | ( ˆ E i d e P

) | ( e c P T ) | ( unigram CL E i d c P

Mandarin Story

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


Model Estimation

Finding document correspondence between dCi ↔ dE

j

➔ by CLIR (e.g. based on Cosine similarity)Translation dictionary PT (c|e) ➔ (e.g. by GIZA++)Given the document correspondence and PT (c|e),

PCL-unigram(c|dEi ) =

∑

e∈E

PT (c|e)P̂ (e|dEi ), ∀c ∈ C (1)

Cross-Language LM constructionBuild story-specific cross-language LMsLinear interpolation with the baseline trigram LM

PCL-interpolated(ck|ck−1, ck−2, dEi ) (2)

= λPCL-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


Latent Semantic Analysis for CLIR

Singular Value Decomposition (SVD) of the parallel corpus

d C N

M N M R R R R N

W U S V T

x x x x

= d E

N

d C 1

d E 1 ...

...

Input : word-document frequency matrix, W

Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S

S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)

Remove noisy entries by setting σi = 0 for i > R

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References




d C J

M N M R R R R N

W U S V T

x x x x

= d E

J





Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References




d C J

d E J

M N M R R R R N

W U S V T

x x x x

=







Folding-in a monolingual corpus

. . . .

0 . . . . 0

d E 1

M P M R R R R P

W U S V T

x x x x

= d E

P

Given a monolingual corpus, W , in either sideUse the same matrices U, S

Project into low-dimensional space, VT

= S−1UT W

Compare a query and a document in the reduced dimensional space

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


LSA for Translation Table PT (c|e)

Word vs. word comparision

M N M R R R R N

W U S V T

x x x x

= e i

c j

Each row in W corresponds to a word (either ei or cj)

Compare cj ∈ C and ei ∈ E to find most similar entries

Estimation of the translation probability ➔ similarity

PLSA(c|e) =Sim(c, e)γ

∑c′∈C

Sim(c, e)γwhere γ � 1 (3)

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


Training and Test Corpora

A parallel text for SVD, GIZA++ trainingHong Kong News (16K document pairs)

Acoustic model trainingHUB4-NE Mandarin training data (96K wds) ∼ 10hours

Chinese monolingual language model trainingXINHUA : 13M wdsHUB4-NE : 96K wds

ASR test set : NIST HUB4-NE test data (only F0 portion)1263 sents, 9.8K wds (1997 ∼ 1998)English CLIR corpus : NAB-TDT

NAB (1997 LA, WP) + TDT-2 (1998 APW, NYT)45K docs, 30M wds

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


ASR Experimental Results

Vocab : 51K for Chinese300-best list rescoringOracle best/worst WER :33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE

Language Model Perp WER CER p-value

XINHUA Trigram 426 49.9% 28.8% –

LSA-interpolated 364 49.3% 28.9% 0.043Trigger+LSA-intpl 351 49.0% 28.7% 0.002CL-interpolated 346 48.8% 28.4% < 0.001

HUB4-NE Trigram 1195 60.1% 44.1% –

LSA-interpolated 695 58.6% 43.1% <0.001Trigger+LSA-intpl 686 58.7% 43.2% <0.001CL-interpolated 630 58.8% 43.1% < 0.001

Table 1: Word-Perplexity and ASR WER comparisons

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


Conclusions

A new approach for cross-lingual language modelingBased on Latent Semantic Analysis

A document-aligned corpus suffices rather than asentence-aligned corpusNo MT translation dictionary is needed for CLIR

Statistically significant improvements in ASR WERStatistically similar results to GIZA++-based results➔ based on p-values of 0.08 (Xinhua) and 0.58(HUB-4NE), measured between CL-interpolated andLSA-interpolatedFuture work

Maximum entropy models for combining cross-lingualLMsApplication to new tasks (e.g. statistical machinetranslation)

Introduction


Model Estimation

LSA for CLIR

LSA for MT

Corpora


Conclusions

References


References

T. Schultz and A. Waibel. Language independent and language adaptive largevocabulary speech recognition. Proc. ICSLP 1998, 5:1819-1822, Sydney,Australia.

W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marathi, J. Morgan,N. Peterek, J. Picone, D. Vergyri and W. Wang. Towards language independentacoustic modeling. Proc. ICASSP 2000, 2:1029 - 1032, Istanbul, Turkey.

D. Yarowsky, G. Ngai and R. Wicentowski. Inducing multilingual text analysistools via robust projection across aligned corpora. Proc. HLT 2001, pages109-116, Santa Monica, CA.

S. Khudanpur and W. Kim. 2004. Contemporaneous text as side-information instatistical language modeling. Computer Speech and Language, Vol. 18/2, pages143-162, 2004.

Hong Kong News parallel text corpus. 2000. Available through the LinguisticData Consortium. http://www.ldc.upenn.edu/Catalog/LDC2000T46.html

Matched Pairs Sentence-Segment Word Error (MAPSSWE) Test.http://www.nist.gov/speech/tests/sigtests/mapsswe.htm

http://www.ldc.upenn.edu/Catalog/LDC2000T46.html

http://www.nist.gov/speech/tests/sigtests/mapsswe.htm

cross-lingual latent semantic analysis for language...

Documents