introduction to spoken language systems

Introduction toSpoken LanguageSystems

StaszekPasko([email protected])

WHOWEARE

Weareateamofscientistsanddevelopersworkingonaudio,speechandlanguagesolutionsthatwillrevolutionizehow

customersinteractwithproductsandservices.

Speech@Amazon Customers

HOWDOESSLUWORK?

SpeechUserInterfaceFlowSkillsASR NLU TTSUser

SpeechWords Intents Actions Speech

Output

Component Input Output Example

Automatic SpeechRecognition(ASR)

Speech Text(1-best ortopalternatives)

“PlayTwo StepsBehindbyDef Leppard”

Natural LanguageUnderstanding(NLU)

Text IntentTypeand“Slots” Intent:PlayMusicIntentSlots:Artist =DefLeppard

Song =TwoStepsBehind

Skills– internalandexternalservices

Intent&Slots Text and/orActions Play <URL>Say“Playing TwoStepsBehindbyDefLeppard”

Text-to-Speech Text Speech “Playing TwoStepsBehindbyDefLeppard”

Howdidwegethere• 1930s:BellLabsvocoderwork,VODER• 1952:BellLabssingle-digitASR• 1950s:OVEandPAT(formantsynthesis)• 1960s:single-vowel/phonemerecognition• 1960s:ASY(articulatorysynthesis)• 1969:BellLabsde-fundsASR

(Modern)HistoryofSLU

TEXT-TO-SPEECH(TTS)

Human-likeTTS• 1982:SAM,firstsoftwaresynthesisprogram• 199x:Firstdiphone synthesis• 1990s:UnitSelection• 2005:IVONATTS• 2000s:NewHMMsystems• 2010+:DNNbasedTTS

TTSevolution

Naturalness

Controllability

SPSSTTS(HMM-based)

UnitSelection

FormantTTS

Diphone TTS

HybridTTS(USguided)

SPSSTTS(DeepModeling-based)

UnitSelection(unlimitedunitsinthecloud?)

WaveNet

HybridTTS(blending)

ArticulatoryTTS

• Goal:Converttextintointelligible,accurateandnaturalspeech• Challenges

– Homographs:wordswrittenidenticallythathavedifferentpronunciation• Ilive inSeattlevs Reportinglive fromSeattle

– Textnormalization:disambiguationofabbreviations,acronyms,units• ‘m’expandedas‘minutes’or‘miles’or‘meters’oreven‘medium’

– Prosodyrequiresunderstandingofsemantics

– Foreignwords,propernames,slangetc.

TTSdevelopment

TTSoperationText

Textnormalization

Grapheme-to-phonemeconversion

Waveformgeneration

Speech

Shehas20$inherpocket.

shehastwentydollarsinherpocket

ˈʃ iˈhæ zˈtwɛ n.tiˈd ɑ .ɫ ə ɹ zˈɪ nˈhɝ ɹ ˈp ɑ .kə t

TTSBackendp|l|iy1|z| ae1|d| …

Parametersprediction(HMM,

DNN)

SpeechGeneration

SpeechInventory

UnitSelection(ViterbiSearch)

SpeechConcatenation

HybridTTS

UnitSelection– Viterbisearcht-uw uw-#

# #

#-t

#-t1

#-t2

#-t3

uw-#1

uw-#2

uw-#3

#-uw1

#-uw3

#-uw2

#-uw4TargetcostConcatenation

cost

• an erroroccurredwhilesearchingforyourroute• becausesnapsweren'tallsoobedientanymore,• nowwesayapple again.andwesayapple,• generalelectricsoarstoday. informationon

generalelectric• quickbreads,zucchini,holiday, crockpot,cake,• soareyoustillkeepingtabsonyouroldteam,• thatweighsmorethanfourtons,disrupts the…

An apple a day, keeps …

BuildingaTTSsystem• Textnormalization,handlingnon-words:rules• TTSlanguagemodel:lexicons• Textanalysis,POS-tagging,prosody:NLP• Dealingwithambiguousinputs• SSMLprocessing,PLSlexicons• Voices!

AUTOMATICSPEECHRECOGNITION(ASR)

AdaptiveASR• 1986/92:Spinx /SphinxII– HMM+n-grams• 1990s:commercialASRsystems(eg.Dragon)• 2000s:HMM+neuralnet• 2010s:HMM+DNN/LSTMnet

ASRdevelopment• Goal:Convertspokenaudiointotext• Challenges

– Noisyenvironment,e.g.infar-field recognitionwehaveroomreverberation,ambientnoise,backgroundspeech

– Largevocabulary,highperplexitydomains,e.g.music– Difficulttopredictspokenformsforcatalogentriesandtheir

associatedpronunciations,e.g.artistnamessuchasU2,P!nk– Acousticallyconfusablestrings(“openthepodbaydoors”/“openthepotbait

oars”)

ASRoperationSpeech

Spectrumanalysis

Phoneme sequence

De-normalization

Text

ˈʃ iˈhæ zˈtwɛ n.tiˈd ɑ .ɫ ə ɹ zˈɪ nˈhɝ ɹ ˈp ɑ .kə t

she has twenty dollars inher pocket

She has 20$inher pocket.

Where’s my Kindle?

25 17 6 24 … 4131 14 11 15 … 3832 11 13 14 … 2621 15 14 8 … 19Etc.

WEH

RZ

M

AY

whereWEHR

where’sWEHRZ

werewolfWEHRWUHLF

AardvarkAXRDVAXRK

KindleKIHNDUL

myMAY

where

where’s

is

my

Mikein dull

kinKindle

StatisticalConversationalASR• UseacousticandlinguisticMLmodels• Inputisaudio,potentiallyfrommanymicrophones• Initialsource-specificmodels/processing• Intermediateoutputisasequenceofpotentialphonemes/diphones /triphones

• Finaloutputarepotentialtexttranscriptions• Requireslotsofmemory

WakeWordEngine• Low-power,continuouslylisteningdevice• Atriggerword(‘Alexa’)• Lackofcontext– pronetonoise• Needtorunonlocaldevice• Needshighlyoptimizedcode&real-timeprocessing

BuildinganASRengine• StatisticalASRmodels– build&combine• Personalization• Constrainedvsfree-forminput(dictation)• Textde-normalization,handlingentitynames• Domain-specificrecognition• Identifyerrors,useasfeedback

NATURALLANGUAGEUNDERSTANDING(NLU)

Understandingthelanguage• 1950:Turingtestdefined• 1964-72:STUDENT,ELIZA,PARRY• 1976:Collosal Cave,Zork – interactivefiction• 1990s:StatisticalMLmodels• 2006:Watson,firstbottowinJeopardy• 2011:Siri 2014:Alexa,Cortana

NLUDevelopment• Goal:understandthespokenintentandassociatedentities• Challenges

– Semanticrepresentationforlanguage– Cross-domainintentrecognition

• e.g.“Playremindme”vs.“Remindmetogototheplay”– RobustnesstoASRerrors andambiguity

• “PlayRollingStone”(BobDylan)vs “PlayRollingStones”– Usercorrectionincontext,“No,therollingstones”– Needtogettopchoicecorrectsincethereisnodisplay

NLPlanguageparsing

“BillsonportsandimmigrationweresubmittedbySenatorBrownback,RepublicanofKansas”

ApproachforNLU

NamedEntityRecognition

(NER)

NERModels

IntentClassification

(IC)

ICModels

Text EntityResolution

RankingModels Catalogs

Interpretations

FinitePatternMatching

/

PersonalAssistantcomponents• Languagemodel• Skills/Intentscatalogue• Entitycatalogs&ontology• Knowledgedatabase(s)• Externaldatasourcesintegration• Personalization

…andsomeothers• NLG– NaturalLanguageGeneration• NLP– NaturalLanguageProcessing• DataMining– buildingtheknowledgebase• Compressiontechniques• Audioprocessingandmediastreaming• Distributedsystems• …

AI.Thefinal frontier?• Deeplearningeverywhere– Wakeword– SpeechRecognition– LanguageUnderstanding– Text-to-Speech

• RequiresalotofDatatotraindeepneuralnetworks(DNNs)and otherMLmodels

+

THANKYOU!

Howtostart?Courses:https://www.coursera.org/learn/machine-learninghttps://www.coursera.org/learn/nlp

Tools:http://cmusphinx.sourceforge.net/http://festvox.org/http://kaldi-asr.org/

http://mallet.cs.umass.edu/https://deeplearning4j.org/https://www.tensorflow.org/

http://gdansk-amazon.com

introduction to spoken language systems

Documents