speech processing 15-492/18-492 · speech processing 15-492/18-492 multilinguality spice: ... grant...

33
Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier

Upload: doanngoc

Post on 03-May-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Speech Processing 15-492/18-492

MultilingualitySPICE: making it easier

Page 2: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Dealing with *all* Languages

��Over 6000 LanguagesOver 6000 Languages�� Maybe not all commercially interesting … nowMaybe not all commercially interesting … now

��Major languages (economic)Major languages (economic)�� Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languages

�� But even those not all coveredBut even those not all covered

Page 3: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

�� ComputerizationComputerization: Speech is key technology: Speech is key technology

�� Mobile Devices, Ubiquitous Information AccessMobile Devices, Ubiquitous Information Access

�� GlobalizationGlobalization: : MultilingualityMultilinguality

�� More than 6000 Languages in the world More than 6000 Languages in the world

�� Multiple official languagesMultiple official languages

�� Europe has 20+ official languagesEurope has 20+ official languages

�� South Africa has 11 official languagesSouth Africa has 11 official languages

⇒⇒ Speech Processing in multiple LanguagesSpeech Processing in multiple Languages�� CrossCross--cultural Humancultural Human--Human InteractionHuman Interaction

�� HumanHuman--Machine Interface in mother tongueMachine Interface in mother tongue

Motivation

Page 4: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Challenges�� Algorithms language independent but require dataAlgorithms language independent but require data

Dozens of hours audio recordings and corresponding transcriptionDozens of hours audio recordings and corresponding transcriptionss

Pronunciation dictionaries for large vocabularies (>100.000 wordPronunciation dictionaries for large vocabularies (>100.000 words)s)

Millions of words written text corpora in various domains in queMillions of words written text corpora in various domains in questionstion

Bilingual aligned text corporaBilingual aligned text corpora

�� BUT: Such data only available in very few languagesBUT: Such data only available in very few languages Audio dataAudio data ≤≤ 4040 languages,languages, Transcriptions take up toTranscriptions take up to 40x 40x real timereal time

Large vocabulary pronunciation dictionariesLarge vocabulary pronunciation dictionaries ≤≤ 2020 languageslanguages

Small text corporaSmall text corpora ≤≤ 100 100 languages,languages, large corpora large corpora ≤≤ 30 30 languageslanguages

Bilingual corpora in very few language pairs, pivot mostly EngliBilingual corpora in very few language pairs, pivot mostly Englishsh

�� Additional complications:Additional complications: Combinatorical explosionCombinatorical explosion (domain, speaking style, accent, dialect, ...)(domain, speaking style, accent, dialect, ...)

Few native speakers at hand for minority (endangered) languagesFew native speakers at hand for minority (endangered) languages

Languages without writing systemsLanguages without writing systems

Page 5: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Solution: Learning Systems

⇒⇒ Systems that learn a language from the userSystems that learn a language from the user

�� EfficientEfficient learning algorithms for speech processinglearning algorithms for speech processing�� Learning:Learning:

Interactive learning with user in the loopInteractive learning with user in the loop

Statistical modeling approachesStatistical modeling approaches

�� Efficiency:Efficiency:

Reduce amount of dataReduce amount of data (save time and costs): by a factor of 10(save time and costs): by a factor of 10

Speed up development cycles:Speed up development cycles: days rather than monthsdays rather than months

⇒⇒ Rapid Language Rapid Language Adaptation from universal modelsAdaptation from universal models

�� Bridge the gap: language and technology expertsBridge the gap: language and technology experts Technology experts do not speak all languages in questionTechnology experts do not speak all languages in question

Native users are not in control of the technologyNative users are not in control of the technology

Page 6: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Sharing data between modules

Lexst LMt

Word s ↔Word t

N-grams

AMtDictt

Word →phone sequence

LMt

N-grams

AMs Dicts

Word →phone sequence

Lexts

Word s ↔Word t

LMs

N-grams

AMs Dicts LMs

Word →phone sequence

N-grams

AMtDictt

Word →phone sequence

Input Ls

Input Lt

Output Ls

Speech-to-Speech Translation

Lsource Ltarget

Lsource Ltarget

Page 7: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

SPICE

Speech Processing: Interactive Creation and Evaluation toolkit

• National Science Foundation, Grant 10/2004, 3 years

• Principle Investigators Tanja Schultz and Alan Black

• Bridge the gap between technology experts → language experts

• Automatic Speech Recognition (ASR),

• Machine Translation (MT),

• Text-to-Speech (TTS)

• Develop web-based intelligent systems

• Interactive Learning with user in the loop

• Rapid Adaptation of universal models to unseen languages

• SPICE webpage http://cmuspice.org

Page 8: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Spice Project Page

Page 9: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Input: Speech

Speech Processing Systems

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

Hello NLP /

MTTTS

Text data

Phone set & Speech data

Page 10: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Input: Speech

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

NLP /

MTTTS

Phone set & Speech data

+

Hello

Rapid Portability: Data

Page 11: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Finding “Nice” Prompts

��From very large text databasesFrom very large text databases��Find “nice” sentences:Find “nice” sentences:

�� Containing only high frequency wordsContaining only high frequency words�� 55--15 words15 words

��Find grapheme/phoneme balanced setFind grapheme/phoneme balanced set�� Select sentences with best Select sentences with best triphonetriphone/graph/graph

��500500--1000 sentences1000 sentences��Collect for ASR and TTS acoustic modelingCollect for ASR and TTS acoustic modeling

Page 12: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Prompt Selection Issues

�� Need good textNeed good text�� DeDe--htmlifyhtmlify, well, well--written, no misspellingwritten, no misspelling

�� Need word segmentationNeed word segmentation�� Japanese, Chinese ThaiJapanese, Chinese Thai

�� Natural text is often mixed languageNatural text is often mixed language�� Hindi Newspaper Text has lots of English wordsHindi Newspaper Text has lots of English words

�� Automatic selection has errorsAutomatic selection has errors�� Need Speaker to do further selectionNeed Speaker to do further selection�� E.g. lots of telephone numbers, E.g. lots of telephone numbers, formatingformating commandscommands

�� CMU Arctic used similar methodsCMU Arctic used similar methods

Page 13: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Recording Prompts

Page 14: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

GlobalPhoneMultilingual Database� Widespread languages� Native Speakers� Uniform Data� Broad Domain� Large Text Resources

� Internet, Newspaper

Corpus� 19 Languages … counting

� ≥ 1800 native speakers� ≥ 400 hrs Audio data� Read Speech� Filled pauses annotated

ArabicCh-Mandarin

Ch-Shanghai

GermanFrench

Japanese

Korean

CroatianPortuguese

Russian

SpanishSwedish

Tamil

Czech

Turkish

+ Thai+ Creole

+ Polish

+ Bulgarian+ ... ???

Now available from ELRA !!

Page 15: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Spe

ech

Rec

ogni

tion

in 1

7La

ngua

ges

1011

.814

1414

.514

.516

.918

1920

20

29

33.5

2021

.723

.4

29

010203040 Japa

nese Ger

man Eng

lish

Thai Kor

ean

Ch-M

anda

rin Turkis

h Frenc

h

Portu

gues

e Croat

ian Spa

nish Bulg

arian Rus

sian

Afrika

ans Chi

nese

Arabic

Iraqi

Word Error Rate [%]

Page 16: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Input: Speech

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

NLP /

MTTTS

Phone set & Speech data

+

Hello

Rapid Portability: Acoustic Models

Page 17: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Speech Production is independent from Language ⇒ IPA1) IPA-based Universal Sound Inventory

2) Each sound class is trained by data sharing

� Reduction from 485 to 162 sound classes

� m,n,s,l appear in all 12 languages� p,b,t,d,k,g,f and i,u,e,a,o in almost all

Problem: Context of sounds are language specific Context dependent models for new languages?Solution:1) Multilingual Decision Context Trees2) Specialize decision tree by Adaptation

Universal Sound Inventory

-1=Plosiv?

N Jk (0)

klau k raut k leot k orin k ar

+2=Vokal?N J

k (1) k (2)lau k rain k ar

ut k leot k or

BlaukrautBrautkleidBrotkorbWeinkarte

Page 18: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Choosing Phonemes

Page 19: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Rapid Portability: Acoustic Model

69,1

57,149,9

40,632,8

28,9

19,6 19

0

20

40

60

80

100

Wor

d E

rror

rat

e [%

]

0 0:15 0:15 0:25 0:25 0:25 1:30 16:30

Ø Tree ML-Tree Po-Tree PDTS

+

Page 20: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Input: Speech

Rapid Portability: Pronunciation Dictionary

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

NLP /

MTTTS

Textdaten„adios“ � /a/ /d/ /i/ /o/ /s/„Hallo“ � /h/ /a/ /l/ /o/ „Phydough“ � ???

Hello

Page 21: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

11.5

19.218.4

24.526.8

15.614 12.7

3336.4

32.8

16

26.4

18.3

0.0

10.0

20.0

30.0

40.0

50.0W

ord

Err

or R

ate

[%]

Phoneme Grapheme (FTT)Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASR

Problem:• 1 Grapheme ≠ 1 Phoneme

Flexible Tree Tying (FTT):One decision tree• Improved parameter tying• Less over specification• Fewer inconsistencies

0=vowel?

0=obstruent? 0=begin-state?

-1=syllabic?0=mid-state?-1=obstruent?0=end-state?

AX-m

IX-m

AX-b

Page 22: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Dictionary: Interactive Learning

* Follow the work ofDavel & Barnard

* Word list: extract from text

User

Word list W

i:= best select

Word wi

Generate pronunciation P(wi)

TTS

P(wi) okay?Yes

Delete wi

No

Update G-2-P

Improve P(wi)

G-2-P

Delete wi

* Update after each wi→ more effective training

* Kominek & Black

* G-2-P- explicit mapping rules - neural networks - decision trees- instance learning

(grapheme context)

LexSkip

Page 23: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Spice: Lex Learner

Page 24: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Spice: Lex Learner

Page 25: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Issues and Challenges��How to make best use of the human?How to make best use of the human?

�� Definition of successful completion Definition of successful completion

�� Which words to present in what order Which words to present in what order

�� How to be robust against mistakes How to be robust against mistakes

�� Feedback that keeps users motivated to continue Feedback that keeps users motivated to continue

��How many words? How many words? �� G2P complexity language dependent G2P complexity language dependent

�� 80% coverage80% coveragehundred (SP) to thousands (EN)hundred (SP) to thousands (EN)

�� G2P rule system perplexity G2P rule system perplexity

16.8016.80DutchDutch

16.7016.70GermanGerman

11.4811.48AfrikaansAfrikaans

1.211.21SpanishSpanish

3.523.52ItalianItalian

50.1150.11EnglishEnglish

PerplexityPerplexityLanguageLanguage

Page 26: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Input: Speech

Rapid Portability: LM

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

NLP /

MTTTS

Text data

Internet / TV

Hello

Inquiry

AutomaticExtraction

LM

Bridge Languages

+

Resource rich languages ↔ Resource low languages:

Page 27: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Parametric TTS�� TextText--toto--speech for G2P Learning: speech for G2P Learning:

�� Technique: phonemeTechnique: phoneme--byby--phoneme concatenation, phoneme concatenation, speech not natural but understandable (speech not natural but understandable (MarelieMarelie DavelDavel))

�� Units are based on IPA phoneme examplesUnits are based on IPA phoneme examples PRO: covers languages through simple adaptationPRO: covers languages through simple adaptation

CONS: not good enough for speech applications CONS: not good enough for speech applications

�� TextText--toto--speech for Applications: speech for Applications: �� Statistical Parametric Systems: Statistical Parametric Systems: clustergenclustergen

�� Clusters representing contextClusters representing context--dependent allophones dependent allophones PRO: can work with little speech (10 minutes) PRO: can work with little speech (10 minutes)

PRO: robust to erroneous data.PRO: robust to erroneous data.

CONS: speech sounds CONS: speech sounds buzzybuzzy, lacks natural prosody, lacks natural prosody

Page 28: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

SPICE: Afrikaans - English�� Goal: Build Afrikaans Goal: Build Afrikaans –– English S2S using SPICEEnglish S2S using SPICE

Cooperation with UniversitCooperation with Universityy StellenboschStellenbosch and ARMSCORand ARMSCOR

Bilingual PhD visited CMU fBilingual PhD visited CMU for 3 month (Herman or 3 month (Herman EngelbrechtEngelbrecht))

Afrikaans: Related to Dutch and EnglishAfrikaans: Related to Dutch and English, , gg--22--p very close, regular grammar, simple morphologyp very close, regular grammar, simple morphology

�� SPICE, all components apply statistical modeling paradigmSPICE, all components apply statistical modeling paradigm ASR: ASR: HMMsHMMs, N, N--gram LM (JRTkgram LM (JRTk--ISL)ISL)

MT: Statistical MTMT: Statistical MT (SMT(SMT--ISL)ISL)

TTS: UnitTTS: Unit--Selection (Festival)Selection (Festival)

DictionaryDictionary: : GG--22--P rules using CART decision treesP rules using CART decision trees

�� Text: 39 Text: 39 hansardhansard; 680k words; ; 680k words; �� 43k bilingual aligned sentence pairs;43k bilingual aligned sentence pairs;

�� Audio: 6 hours read speech; 10k utterances, Audio: 6 hours read speech; 10k utterances,

Page 29: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

SPICE: Time effort�� Results: ASR 20% WER; MT AResults: ASR 20% WER; MT A--E (EE (E--A) Bleu 34.1 (34.7), A) Bleu 34.1 (34.7), NistNist 7.6 (7.9)7.6 (7.9)

�� Shared pronunciation dictionaries (fShared pronunciation dictionaries (for ASR+TTS) and LM or ASR+TTS) and LM (f(for ASR+MT)or ASR+MT)

�� Most time consuming process: data preparation Most time consuming process: data preparation →→ reduce amount of data!reduce amount of data!

�� Still too much expert knowledge required (e.g. ASR parameter tunStill too much expert knowledge required (e.g. ASR parameter tuning!) ing!)

5 8 7

311

5 50

5

10

15

20

25

Data Training Tuning Evaluation Prototype

daysAM (ASR) Lex LM (ASR, MT) TM (MT) TTS S-2-S

Page 30: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

Current Tests

�� 11 students is CMU class11 students is CMU class�� Hindi (2), Vietnamese (2), French, German (2), Hindi (2), Vietnamese (2), French, German (2),

Bulgarian, Telugu, Cantonese, Mandarin.Bulgarian, Telugu, Cantonese, Mandarin.�� Build complete S2S systemBuild complete S2S system

�� Teams of 2 for translation on small domainTeams of 2 for translation on small domain�� Translation is simple phraseTranslation is simple phrase--based based

�� Purpose:Purpose:�� Have students get full experienceHave students get full experience�� Find bugs/limitation in the systemFind bugs/limitation in the system�� Evaluation resulting systems for development time and Evaluation resulting systems for development time and

accuracyaccuracy

Page 31: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black
Page 32: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black

HW2: TTS

��Due 3:30pm Monday October 20Due 3:30pm Monday October 20thth

�� Install Festival and Install Festival and FestvoxFestvox��Find 10 errors in each of two different Find 10 errors in each of two different

synthesizerssynthesizers��Build a voiceBuild a voice

�� A Talking ClockA Talking Clock�� A general voiceA general voice�� (or both)(or both)

Page 33: Speech Processing 15-492/18-492 · Speech Processing 15-492/18-492 Multilinguality SPICE: ... Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black