topic-independent speaking-style transformation of language model for spontaneous speech recognition...

Topic-independent Speaking-Style Transformation of Language model for

Spontaneous Speech Recognition

Yuya Akita , Tatsuya Kawahara

Introduction

• Spoken-style v.s. writing style– Combination of document and spontaneous corpus

• Irrelevant linguistic expression– Model transformation

• Simulated spoken-style text by randomly inserting fillers• Weighted finite-state transducer framework (?)• Statistical machine translation framework

• Problem with Model transformation methods– Small corpus, data sparseness – One of solutions:

• POS tag

Statistical Transformation of Language model

• Posteriori:

– X: source language model (document style)– Y: target language model (spoken language)

• So,

– P(X|Y) and P(Y|X) are transformation model

• Transformation models can be estimated using parallel corpus – n-gram count:

XP

YPYXPXYP

||

YXP

XYPXPYP

|

|

yxP

xyPxNyN n

LMn

LM |

|11

Statistical Transformation of Language model (cont.)

• Data sparseness problem for parallel corpus– POS information

• Linear interpolation• Maximum entropy

Training

• Use aligned corpus– Word-based transformation probability

– POS-based transformation probability

– Pword(x|y) and PPOS(x|y) are estimated accordingly

xN

xyNxyPword

|

xN

xyNxyPPOS

|

Training (cont.)

• Back-off scheme

• Linear interpolation scheme

• Maximum entropy scheme

– ME model is applied to every n-gram entry of document-style model

– spoken-style n-garm is generated if transform probability is larger than a threshold

exists if else

exists if

yxxyP

yxxyPxyP

POSPOS

word

|

||

xyPxyPxyP POSword |1||

i

ii yxfZ

xyP ,exp1

|

Experiments

• Training coprus:– Baseline corpus: National Congress of Japan, 71M words– Parallel corpus: budget committee in 2003, 666K– Corpus of Spontaneous Japan, 2.9M words

• Test corpus:– Another meeting of Budget committee in 2003, 63k words

Experiments (cont.)

• Evaluation of Generality of transformation model• LM

Experiments (cont.)

• r

Conclusions

• Propose a novel statistical transformation model approach

Non-stationary n-gram model

Concept

• Probability of sentence– n-gram LM

• Actually,

• Miss long-distance and word position information while applying Markov assumption

n

iiinii wwwwPsP

1121 ,,...|

n

iiinii

n

iplplplpl

n

iplplplpl

wwwwP

wwwwP

wwwwPsP

iiiininiii

iiii

1121

1,,,,

1,,,,

,,...|

,,...,|

,...,,|

112211

112211

Concept (cont.)

•

n

iiinii

n

iplplplpl

twwwwP

wwwwPsPiiiininiii

1121

1,,,,

,,,...|

,,...,|112211

Training (cont.)

• ML estimation

• Smoothing– Use low order – Use small bins– Transform with

Smoothed normal ngram

• Combination– Linear interpolation– Back-off

twwC

twwCtwwwwp

ini

iniiinii ,,...,

,,...,,,,...,|

11

1121

Smoothing with lower order (cont.)

• Additive smoothing

• Back-off smoothing

• Linear interpolation

VtwwC

twwCtwwP

ii

iiii

,,

1,,,|

1

11

otherwise ,

0,, if ,|,| 11

1 twP

twwCtwwPtwwP

i

iiiiGTii

twPttwwPttwwP iiiii ,1,|,| 11

Smoothing with small bins (k=1) (cont.)

• Back-off smoothing

• Linear interpolation

• Hybrid smoothing

otherwise |,

0,, if ,|,|

1

111

ii

iiiiGTii wwP

twwCtwwPtwwP

111 |~

1,|,|ˆ iiiiii wwPttwwPttwwP

iiiii wPwwPwwP 1||~

11

11 ||~

iiGTii wwPwwP

111 |~

1,|,|ˆ iiiiii wwPttwwPttwwP

Transformation with smoothed ngram

• Novel method

– If t-mean(w) decreases, the word is more important– Var(w) is used to balance t-mean(w) for active words– active word: words can appears at any position in the sentences

• Back-off smoothing & linear interpolation

11 |

1,|

2

iiSMOOTHEDwMeant

wVar

ii wwPeZ

twwP i

i

Experiments

Observation: Marginal position & middle position

Experiments (cont.)

• NS bigram

Experiments (cont.)

• Comparison with three smoothing techniques

Experiments (cont.)

• Error rate with different bins

Conclusions

• Traditional n-gram model is enhanced by relaxing its stationary hypothesis and exploring the word positional information in language modeling

Two-way Poisson Mixture model

Essential

• Poisson distribution

• Poisson mixture model

!

|n

enP

n

kR

r

p

jjrkjkr xkYxXP

1 1,,||

!

|,,

,,,,

j

xjrk

jrkj x

ex

jrkj

Poisson Rk

Poisson 1

Poisson 2…

Class k

πk1

πk2

πkRk

Σ

xp

...

X2

X1

Document x Multivariate Poisson, dim = p (lexicon size)

*Word clustering: reduce Poisson dimension=> Two-way mixtures

topic-independent speaking-style transformation of language model for spontaneous speech recognition...

Documents