cross-lingual latent semantic analysis for language...
TRANSCRIPT
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 1/12
Cross-Lingual Latent Semantic Analysis forLanguage Modeling
May 21, 2004
Woosung Kim and Sanjeev Khudanpur
Center for Language and Speech Processing
Dept. of Computer Science
The Johns Hopkins University
Baltimore, MD 21218, USA
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 2/12
Introduction
Motivation :Success of statistical modeling techniques
Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German
How to construct stochastic models in resource-deficientlanguages?
➔ Bootstrap from other languages, e.g.Universal phone-set for Automatic Speech Recognition(ASR)[Schultz & Waibel, 98, Byrne et al, 00]Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.[Yarowsky, Ngai & Wicentowski, 01]Cross-Lingual language modeling for ASR[Khudanpur & Kim 04]
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 2/12
Introduction
Motivation :Success of statistical modeling techniques
Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German
How to construct stochastic models in resource-deficientlanguages? ➔ Bootstrap from other languages, e.g.
Universal phone-set for Automatic Speech Recognition(ASR)[Schultz & Waibel, 98, Byrne et al, 00]Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.[Yarowsky, Ngai & Wicentowski, 01]Cross-Lingual language modeling for ASR[Khudanpur & Kim 04]
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 3/12
Introduction
Overview of [Khudanpur & Kim 04] :An approach to sharpen an LM in a resource-deficient languagebased on comparable texts from resource-rich languagesStory-specific language modeling from contemporaneous textIntegration of machine translation (MT), cross-language informationretrieval (CLIR), and language modeling (LM)A sentence-aligned parallel corpus is needed to build an MT dictionary=⇒ Expensive to obtain (esp. in resource-deficient languages)
We present a method:To use Latent Sematic Analysis (LSA) for CLIR and MT
Document-aligned parallel corpus is enoughNo explicit MT dictionary is needed for CLIR
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 3/12
Introduction
Overview of [Khudanpur & Kim 04] :An approach to sharpen an LM in a resource-deficient languagebased on comparable texts from resource-rich languagesStory-specific language modeling from contemporaneous textIntegration of machine translation (MT), cross-language informationretrieval (CLIR), and language modeling (LM)A sentence-aligned parallel corpus is needed to build an MT dictionary=⇒ Expensive to obtain (esp. in resource-deficient languages)
We present a method:To use Latent Sematic Analysis (LSA) for CLIR and MT
Document-aligned parallel corpus is enoughNo explicit MT dictionary is needed for CLIR
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 4/12
Cross-Lingual LM for ASR
Automatic Speech Recognition
Cross-Language Information Retrieval
Statistical Machine Translation
Baseline Chinese Acoustic Model
Chinese Dictionary (Vocabulary)
Baseline Chinese Language Model
Translation Lexicons
Cross-Language Unigram Model
Automatic Transcription
Contemporaneous English Articles
English Article Aligned with Mandarin Story
C i d
E i d
) | ( ˆ E i d e P
) | ( e c P T ) | ( unigram CL E i d c P
Mandarin Story
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 5/12
Model Estimation
Finding document correspondence between dCi ↔ dE
j
➔ by CLIR (e.g. based on Cosine similarity)Translation dictionary PT (c|e) ➔ (e.g. by GIZA++)Given the document correspondence and PT (c|e),
PCL-unigram(c|dEi ) =
∑
e∈E
PT (c|e)P̂ (e|dEi ), ∀c ∈ C (1)
Cross-Language LM constructionBuild story-specific cross-language LMsLinear interpolation with the baseline trigram LM
PCL-interpolated(ck|ck−1, ck−2, dEi ) (2)
= λPCL-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 6/12
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C N
M N M R R R R N
W U S V T
x x x x
= d E
N
d C 1
d E 1 ...
...
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 6/12
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C J
M N M R R R R N
W U S V T
x x x x
= d E
J
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 6/12
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C J
d E J
M N M R R R R N
W U S V T
x x x x
=
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 7/12
Latent Semantic Analysis for CLIR
Folding-in a monolingual corpus
. . . .
0 . . . . 0
d E 1
M P M R R R R P
W U S V T
x x x x
= d E
P
Given a monolingual corpus, W , in either sideUse the same matrices U, S
Project into low-dimensional space, VT
= S−1UT W
Compare a query and a document in the reduced dimensional space
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 8/12
LSA for Translation Table PT (c|e)
Word vs. word comparision
M N M R R R R N
W U S V T
x x x x
= e i
c j
Each row in W corresponds to a word (either ei or cj)
Compare cj ∈ C and ei ∈ E to find most similar entries
Estimation of the translation probability ➔ similarity
PLSA(c|e) =Sim(c, e)γ
∑c′∈C
Sim(c, e)γwhere γ � 1 (3)
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 9/12
Training and Test Corpora
A parallel text for SVD, GIZA++ trainingHong Kong News (16K document pairs)
Acoustic model trainingHUB4-NE Mandarin training data (96K wds) ∼ 10hours
Chinese monolingual language model trainingXINHUA : 13M wdsHUB4-NE : 96K wds
ASR test set : NIST HUB4-NE test data (only F0 portion)1263 sents, 9.8K wds (1997 ∼ 1998)English CLIR corpus : NAB-TDT
NAB (1997 LA, WP) + TDT-2 (1998 APW, NYT)45K docs, 30M wds
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 10/12
ASR Experimental Results
Vocab : 51K for Chinese300-best list rescoringOracle best/worst WER :33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE
Language Model Perp WER CER p-value
XINHUA Trigram 426 49.9% 28.8% –
LSA-interpolated 364 49.3% 28.9% 0.043Trigger+LSA-intpl 351 49.0% 28.7% 0.002CL-interpolated 346 48.8% 28.4% < 0.001
HUB4-NE Trigram 1195 60.1% 44.1% –
LSA-interpolated 695 58.6% 43.1% <0.001Trigger+LSA-intpl 686 58.7% 43.2% <0.001CL-interpolated 630 58.8% 43.1% < 0.001
Table 1: Word-Perplexity and ASR WER comparisons
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 11/12
Conclusions
A new approach for cross-lingual language modelingBased on Latent Semantic Analysis
A document-aligned corpus suffices rather than asentence-aligned corpusNo MT translation dictionary is needed for CLIR
Statistically significant improvements in ASR WERStatistically similar results to GIZA++-based results➔ based on p-values of 0.08 (Xinhua) and 0.58(HUB-4NE), measured between CL-interpolated andLSA-interpolatedFuture work
Maximum entropy models for combining cross-lingualLMsApplication to new tasks (e.g. statistical machinetranslation)
Introduction
Cross-Lingual LM for ASR
Model Estimation
LSA for CLIR
LSA for MT
Corpora
Experimental Results
Conclusions
References
Cross-Lingual LSA for Language Modeling ICASSP : May 21, 2004 - p. 12/12
References
T. Schultz and A. Waibel. Language independent and language adaptive largevocabulary speech recognition. Proc. ICSLP 1998, 5:1819-1822, Sydney,Australia.
W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marathi, J. Morgan,N. Peterek, J. Picone, D. Vergyri and W. Wang. Towards language independentacoustic modeling. Proc. ICASSP 2000, 2:1029 - 1032, Istanbul, Turkey.
D. Yarowsky, G. Ngai and R. Wicentowski. Inducing multilingual text analysistools via robust projection across aligned corpora. Proc. HLT 2001, pages109-116, Santa Monica, CA.
S. Khudanpur and W. Kim. 2004. Contemporaneous text as side-information instatistical language modeling. Computer Speech and Language, Vol. 18/2, pages143-162, 2004.
Hong Kong News parallel text corpus. 2000. Available through the LinguisticData Consortium. http://www.ldc.upenn.edu/Catalog/LDC2000T46.html
Matched Pairs Sentence-Segment Word Error (MAPSSWE) Test.http://www.nist.gov/speech/tests/sigtests/mapsswe.htm