personalising speech to-speech translation

36
Presenter: Behzad Ghorbani 2013/12/8 IRIB University Personalising speech-to- speech translation

Upload: behzad66

Post on 24-Dec-2014

119 views

Category:

Education


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Personalising speech to-speech translation

Presenter: Behzad Ghorbani2013/12/8

IRIB University

Personalising speech-to-speech translation

Page 2: Personalising speech to-speech translation

AbstractIntroductionSpeech-to-speech translationSpeaker adaptation for HMM-based TTS

Outlines

Page 3: Personalising speech to-speech translation

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application

of our research is the personalization of speech-to-speech translation in which we employ a HMM statistical framework forboth speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to thevoice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingualadaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system.

Abstract

Page 4: Personalising speech to-speech translation

Speech-to-speech translationconvert speech in the input language into text in the input language

convert text in the input language into text in the output language

convert text in the output language into speech in the output language

Page 5: Personalising speech to-speech translation

Automatic Speech Recognition

Page 6: Personalising speech to-speech translation

TTS to convert text in the output language into speech in the output language

Text To Speech

Page 7: Personalising speech to-speech translation

the major focus of our work is on the personalization of speech-to-speech translation usingHMM-based ASR and TTS, which involves the development of unifying techniques for ASR and TTS as well as the investigation of methods for unsupervised and cross-lingual modelling and adaptation for TTS

SST-with hidden Markov models

Page 8: Personalising speech to-speech translation

Acoustic featuresASR features based on low dimensional short

term spectral.

TTS acoustic feature extraction includes

mel- cepstrum. Acoustic modelling

ASR : a basic HMM topology using phonetic decision tree .

TTS acoustic models use multiple stream, single Gaussian state emission pdfs with decision tree

HMM-based ASR and TTS

Page 9: Personalising speech to-speech translation

Pipeline translation framework

Page 10: Personalising speech to-speech translation

modules independently of one another used

ASR is necessary to extract text for the machine translator

Adapt TTS models to the user’s voice characteristics

large degree of redundancy in the system complex integration of componentsinvolve any compromises to performance

Pipeline translation framework

Page 11: Personalising speech to-speech translation

Unified translation framework

Page 12: Personalising speech to-speech translation

Use common modules for both ASR and TTSconceptually simpler with a minimum of

redundancyTTS front-end is not required on the input

language sideuse of common feature extraction and

acoustic modelling techniques for ASR and TTS

Unified translation framework

Page 13: Personalising speech to-speech translation

Ideally : collect a large amount of speech data But … are extremely time-consuming and expensive

Speaker adaptation has been proposed Average voice model set is train transformed to that of the target speaker using

utterances read by the particular speaker

Speaker adaptation for HMM-based TTS

Page 14: Personalising speech to-speech translation

Speaker adaptation plays two key roles in SST

On the ASR side : increase the recognition accuracy

On the TTS side : used to personalize the speech synthesised in the output language

main challenges unsupervised adaptation and

cross- lingual adaptation of TTS

Speaker adaptation for HMM-based TTS

Page 15: Personalising speech to-speech translation

Using TTS front-endTwo-pass decision tree constructionDecision tree marginalisationDifferences between two-pass decision tree

construction and decision tree marginalisation

Unsupervised adaptation

Page 16: Personalising speech to-speech translation

straight-forward approacha combination of a word continuous speech recognition and conventional speaker adaptation for HMM -based TTS

Using TTS front-end

Page 17: Personalising speech to-speech translation

Two-pass decision tree construction

Page 18: Personalising speech to-speech translation

Pass 1 : uses only questions relating to left, right and central phonemes to construct a phonetic decision tree

Pass 2 : extends the decision treeconstructed in Pass 1 by introducing additional questions relating to suprasegmental information.

Two-pass decision tree construction

Page 19: Personalising speech to-speech translation

State-mapping based approaches to cross-lingual adaptation

KLD-based state mappingProbabilistic state mapping

Cross-lingual adaptation

Page 20: Personalising speech to-speech translation

Establishing state level mapping rules consists of two steps

two average voice models are trained in two languages

each HMM state Ωs in the language ‘s’ is associated with a HMM stateΩg

In the transformmapping approach, intra-lingual adaptation is

first carried out in the input language.intra-lingual adaptation is first carried out in

the input language

State-mapping based approaches to cross-lingual adaptation

Page 21: Personalising speech to-speech translation

Based on the above KLD measurement, the nearest state model Ωs k in the output language for each state model Ωj in the input language is calculated as

Finally, we map all the state models in the input language to the state models in the output language, which can be formulated as

establish a state mapping from the model space of an input language to that of an output language that the mapping direction can be reversed

KLD-based state mapping

Page 22: Personalising speech to-speech translation

The state-mapping is learned by searching for pairs of states that have minimum KLD between input and output language HMMs.

KLD-based state mapping

Page 23: Personalising speech to-speech translation

previously described generate a deterministic mapping

An alternative is to derive a stochastic mapping which could take the form of a mapping between states, or states directly to the adaptation data

The simplest way : performing ASR on the adaptation data using an acoustic model of the output language

The resulting sequence of recognized phonemes provides the mapping from data in the input language to states in the output language

the phoneme sequence itself is meaningless

Probabilistic state mapping

Page 24: Personalising speech to-speech translation

in this section describe independent studies that investigate SST in the context of different combinations of the above approaches

Experimental studies

Page 25: Personalising speech to-speech translation

aim in presenting these studies to show the range of techniques that evaluated to date and to discuss their relative benefits and disadvantages

to characterize the effectiveness of the approaches presented with respect to three main criteria using conventional objective and subjective metrics

Generality across languagesSupervised versus unsupervised adaptationCross-lingual versus intra-lingual adaptation

Experimental studies

Page 26: Personalising speech to-speech translation

Study 1: Finnish–English used a simple unsupervised probabilistic mapping

technique using two-pass decision tree construction

speaker adaptive training theWall Street Journal (WSJ) SI84 dataset English and Finnish versions of this dataset are recorded in

identical acoustic environments by a native Finnish speaker also competent in English

Experimental studies

Page 27: Personalising speech to-speech translation

System A: average voice. System B: unsupervised cross-lingual

adapted. System C: unsupervised intralingua adapted. System D: supervised intralingua adapted. System E: vocoded natural speech.

Experimental studies

Page 28: Personalising speech to-speech translation

listeners judged the naturalness of an initial set of synthesized

utterances similarity of synthesized utterances to the

target speaker’s speechjudged using a five point psychometric

response scale24 native English and 16 native Finnish

speakers for evaluation

Experimental studies

Page 29: Personalising speech to-speech translation

s

Experimental studies

Page 30: Personalising speech to-speech translation

Study 2: Chinese–Englishcomparing different cross-lingual speaker

adaptation schemes in supervised & unsupervised

Setupusing the Mandarin Chinese-English language pairsets on the corpora SpeeCon (Mandarin) and WSJ

(English)two Chinese students (H and Z) who also spoke

EnglishMandarin and English were defined as input (L1)

and output (L2) languages, respectivelyevaluated four different cross-lingual adaptation

schemes

Experimental studies

Page 31: Personalising speech to-speech translation

Resultsevaluated system performance using objective

metricstest consisted of two sections: naturalness and

speaker similarityTwenty listeners test

Experimental studies

Page 32: Personalising speech to-speech translation

V

Experimental studies

Page 33: Personalising speech to-speech translation

Study 3: English–JapaneseAlso performed a speech-to-speech translation systemSetup

experiments on unsupervised English-to-Japanese speaker adaptation for HMM-based

speaker-independent model for ASRaverage voice model for TTSEnglish : 84 speakers included in the “short term”

subset (15 hours of speech). Japanese : 86 speakers from the JNAS database (19 h of

speech)one female American English speaker not included in

the training set

Experimental studies

Page 34: Personalising speech to-speech translation

Setup …sampled at a rate of 16 kHzwindowed by a 25ms Hamming window with a

10ms shiftResults

Synthetic generated from 7 models10 Japanese native listeners participated in the

listeningthe first original utterance from the database

and then synthetic speechgive an opinion score for the second sample

relative to the first (DMOS)

Experimental studies

Page 35: Personalising speech to-speech translation

Experimental studies

Page 36: Personalising speech to-speech translation

Experimental studies