building a state-of-the-art asr system with kaldi
TRANSCRIPT
![Page 1: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/1.jpg)
BuildingSpeechRecogni0onSystemswiththeKaldiToolkit
SanjeevKhudanpur,DanPoveyandJanTrmalJohnsHopkinsUniversity
CenterforLanguageandSpeechProcessingJune13,2016
![Page 2: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/2.jpg)
Inthebeginning,therewasnothing• ThenKaldiwasborninBal0more,MD,in2009.
![Page 3: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/3.jpg)
Kaldithengrewup&became…
0
50
100
150
200
250
Jan-12
Mar-12
May-12
Jul-1
2
Sep-12
Nov-12
Jan-13
Mar-13
May-13
Jul-1
3
Sep-13
Nov-13
Jan-14
Mar-14
May-14
Pos$ngstoDiscussionList
60+ContributorsIconfromhVp://thumbs.gograph.com
![Page 4: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/4.jpg)
Meanwhile,SpeechSearchwentfrom“Solved”to“Unsolved”…Again
• NISTTRECSDR(1998)– Spoken“document”retrievalfromSTToutputasgoodasretrievalfromreferencetranscripts
– Speechsearchwasdeclaredasolvedproblem!
• NISTSTDPilot(2006)– STTwasfoundtobeinadequateforspoken“term”detec0oninconversa0onaltelephonespeech
• LimitedlanguagediversityinCTScorpora– EnglishSwitchboard,CallHomeandFisher– ArabicandMandarinChineseCallHome
![Page 5: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/5.jpg)
In2012,IARPAlaunchedBABELOnemonthaderDanPoveyreturnedtoKaldi’sbirthplace
• Automa0ctranscrip0onofconversa0onaltelephonespeechwass0llthecorechallenge.
• Butwithafewsubtle,crucialchanges– FocusedaVen0ononlow-resourcecondi0ons– Requiredconcurrentprogressinmul0plelanguages
• PY1:Cantonese,Tagalog,Pashto,TurkishandVietnamese• PY2:Assamese,Bengali,Hai0anCreole,Lao,ZuluandTamil
– Reducedsystemdevelopment0mefromyeartoyear– Usedkeywordsearchmetricstomeasureprogress
![Page 6: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/6.jpg)
KaldiTodayAcommunityofResearchersCoopera0velyAdvancingSTT
• C++library,command-linetools,STT“recipes”– FreelyavailableviaGitHub(Apache2.0license)
• TopSTTperformanceinopenbenchmarktests– E.g.NISTOpenKWS(2014)andIARPAASpIRE(2015)
• Widelyadoptedinacademiaandindustry– 300+cita0onsin2014(basedonGooglescholardata)– 400+cita0onsin2015(basedonGooglescholardata)– UsedbyseveralUSandnon-UScompanies
• Main“trunk”maintainedbyJohnsHopkins– Forkscontainspecializa0onsbyJHUandothers
![Page 7: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/7.jpg)
Co-PI’s,PhDStudentsandSponsors• SanjeevKhudanpur• DanielPovey• JanTrmal• GuoguoChen• PegahGhahremani• VimalManohar• VijayadityaPeddin0• HainanXu• XiaohuiZhang• andseveralothers
![Page 8: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/8.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 9: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/9.jpg)
SenngupPaths,QueueCommands,…
![Page 10: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/10.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous$cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 11: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/11.jpg)
PreparingAcous0cTrainingData
![Page 12: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/12.jpg)
data/train/text
![Page 13: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/13.jpg)
data/train/wav.scp
![Page 14: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/14.jpg)
data/train/(uV2spk|spk2uV)
![Page 15: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/15.jpg)
data/train/(cmvn.scp|feats.scp)
![Page 16: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/16.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia$onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 17: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/17.jpg)
PreparingthePronuncia0onLexicon
![Page 18: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/18.jpg)
data/local/dict/lexicon.txt
![Page 19: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/19.jpg)
data/local/dict/*silence*.txt
![Page 20: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/20.jpg)
data/local/lang
![Page 21: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/21.jpg)
WordBoundaryTags
![Page 22: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/22.jpg)
Disambigua0onSymbols
![Page 23: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/23.jpg)
data/lang
![Page 24: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/24.jpg)
data/lang/(phones|words).txt
![Page 25: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/25.jpg)
data/lang/topo
![Page 26: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/26.jpg)
data/lang/phones/roots.txt
![Page 27: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/27.jpg)
data/lang/phones/extra_ques0ons.txt
![Page 28: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/28.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 29: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/29.jpg)
PreparingtheLanguageModel
![Page 30: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/30.jpg)
local/train_lms_srilm.sh
![Page 31: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/31.jpg)
local/train_lms_srilm.sh(cont’d)
![Page 32: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/32.jpg)
InterpolatedLanguageModels
![Page 33: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/33.jpg)
local/arpa2G.sh
![Page 34: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/34.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous$cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 35: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/35.jpg)
GMMTraining(1)
![Page 36: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/36.jpg)
GMMTraining(2)
![Page 37: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/37.jpg)
cluster-phones,compile-ques0ons,build-tree
![Page 38: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/38.jpg)
GMMTraining(4)
![Page 39: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/39.jpg)
GMMTraining(5)
![Page 40: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/40.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea$ngasta$cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 41: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/41.jpg)
BuildingHCLG(1)
![Page 42: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/42.jpg)
BuildingHCLG(2)
![Page 43: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/43.jpg)
BuildingHCLG(3)
![Page 44: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/44.jpg)
BuildingHCLG(4)
![Page 45: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/45.jpg)
DecodingandLanceRescoring
![Page 46: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/46.jpg)
steps/decode_sgmm2.sh
![Page 47: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/47.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– LaAcerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 48: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/48.jpg)
steps/lmrescore_const_arpa.sh
![Page 49: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/49.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 50: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/50.jpg)
local/nnet3/run_ivector_common.sh
![Page 51: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/51.jpg)
steps/nnet3/tdnn/make_configs.py
![Page 52: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/52.jpg)
steps/nnet3/train_dnn.py
![Page 53: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/53.jpg)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
![Page 54: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/54.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 55: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/55.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 56: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/56.jpg)
DeepNeuralNetworksforSTT
![Page 57: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/57.jpg)
DNNAcous0cModelsfortheMasses
• NontrivialtogettheDNNmodelstoworkwell– Designdecisions:#layers,#nodes,#outputs,typeofnonlinearity,trainingcriterion
– Trainingart:learningrates,regulariza0on,updatestability(maxchange),datarandomiza0on,#epochs
– Computa0onalart:matrixlibraries,memorymgmt• Kaldirecipesprovidearobuststar0ngpoint
Corpus TrainingSpeech SGMMWER DNNWER
BABELPashto 10hours 69.2% 67.6%
BABELPashto 80hours 50.2% 42.3%
FisherEnglish 2000hours 15.4% 10.3%
![Page 58: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/58.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 59: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/59.jpg)
Low-ResourceSTTfortheMasses
• Kaldiprovideslanguage-independentrecipes– TypicalBABELFullLPcondi0on
• 80hoursoftranscribedspeech,800KwordsofLMtext,20Kwordpronuncia0onlexicon
– TypicalBABELLimitedLPcondi0on• 10hoursoftranscribedspeech,100KwordsofLMtext,6Kwordpronuncia0onlexicon
Language Cantonese Tagalog Pashto Turkish
Speech 80h 10h 80h 10h 80h 10h 80h 10h
CER/WER 48.5% 61.2% 46.3% 61.9% 50.7% 63.0% 51.3% 65.3%
ATWV 0.47 0.26 0.56 0.28 0.46 0.25 0.52 0.25
![Page 60: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/60.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 61: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/61.jpg)
Parallel(GPU-based)Training
• Originalneuralnetworktrainingalgorithmswereinherentlysequen0al(e.g.SGD)
• Scalingupto“bigdata”becomesachallenge• Severalsolu0onshaveemergedrecently– 2009:DelayedSGD(Yahoo!)– 2011:Lock-freeSGD(Hogwild!UWisconsin)– 2012:Gradientaveraging(DistBelief,Google)– 2014:Modelaveraging(NG-SGD,Kaldi)
![Page 62: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/62.jpg)
![Page 63: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/63.jpg)
ModelAveragingwithNG-SGD
• TrainDNNswithlargeamountofdata– U0lizeaclusterofCPUsorGPUs– Minimizenetworktraffic(esp.forCPUs)
• Solu0on:exploitdataparalleliza0on– Updatemodelinparallelovermanymini-batches– Infrequentlyaveragemodels(parameters)
• Use“Natural-Gradient”SGDformodelupda0ng– Approximatescondi0oningviainverseFishermatrix– Improvesconvergenceevenwithoutparalleliza0on
![Page 64: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/64.jpg)
Paralleliza0onMaVers!
• Typically,aGPUis10xfasterthana16coreCPU• Linearspeed-up0llca4GPUs,thendiminishing
![Page 65: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/65.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 66: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/66.jpg)
IARPA’sOpenChallenge• Automa0cspeechrecogni0onsodwarethatworksinavariety
ofacous0cenvironmentsandrecordingscenariosisaholygrailofthespeechresearchcommunity.
• IARPA’sAutoma0cSpeechrecogni0onInReverberantEnvironments(ASpIRE)Challengeisseekingthatgrail.
![Page 67: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/67.jpg)
RulesoftheASpIREChallenge
• 15hoursofspeechdatawerepostedontheIARPAwebsite– Mul0-microphonerecordingsofconversa0onalEnglish– 5hdevelopmentset(dev),10hdevelopment-testset(dev-test)– Transcrip0onsprovidedfordev,onlyscoringfordev-testoutput– Fortrainingdataselec0on,systemdevelopmentandtuning
• 12hoursofnewspeechdataduringtheevalua0onperiod– Far-fieldspeech(eval)fromnoisy,reverberantrooms– Single-microphoneormul0-microphonecondi0ons
• Worderrorrateisthemeasureofperformance– Single-microphonesubmissionsweredueon02/18/2015– Resultswereofficiallyannouncedon09/10/2015
![Page 68: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/68.jpg)
ExamplesofASpIREAudio
• Typicalsample– SuggestedbyDr.MaryHarper
• Almostmanageable– Easyforhumans,26%errorsforASR
• Somewhathard– Easyforhumans,41%errorsforASR
• Muchharder– Noteasyforhumans,60%errorsforASR
• *#@!!#%#%^^– Veryhardforhumans,noASRoutput
![Page 69: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/69.jpg)
KaldiASRImprovementsforASpIRE
• Timedelayneuralnetworks(TDNN)– Awaytodealwithlongacous0c-phone0ccontext– Astructuredalterna0vetodeep/recurrentneuralnets
• Dataaugmenta$onwithsimulatedreverbera0ons– Awaytomi0gatechanneldistor0onsnotseenintraining– Aformofmul0-condi0ontrainingofASRmodels
• i-vectorbasedspeaker&environmentadapta$on– Awaytodealwithspeaker&channelvariability– Adapted[withatwist]fromSpeakerIDsystems
![Page 70: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/70.jpg)
KaldiASRImprovements,ASpIRE++
• Pronuncia$onandinter-wordsilencemodeling– Inspiredbypronuncia0on-prosodyinterac0ons– Asimplecontext-dependentmodelofinter-wordsilence
• Recurrentneuralnetworklanguagemodels(RNNLM)– A(known)waytomodellong-rangeworddependencies– Incorporatedpost-submissionintoJHUASpIREsystem
• OngoingKaldiinves0ga0onsthatholdpromise– Semi-superviseddiscrimina0vetrainingof(T)DNNs– Longshort-termmemory(LSTM)acous0cmodels– Connec0onisttemporalclassifica0on(CTC)models
![Page 71: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/71.jpg)
TimeDelayNeuralNetworks(SeeourpaperatINTERSPEECH2015fordetails)
![Page 72: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/72.jpg)
A28YearOldIdea,Resurrected
AlexWaibel,KevinLang,etal(1987)t-13 t+9
t+7
t+5 t-10
t-7 t+2
t-1 t-4
t-8 t-5 t-2 t+1 t+4 t-11
-7 +2
-3 +3 -3 +3
-1 +2 -1 +2 -1 +2 -1 +2
Layer 4
Layer 3
Layer 2
Layer 1 +2 -2
OurTDNNArchitecture(2015)
1
5
2
3
4
![Page 73: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/73.jpg)
ImprovedASRonSeveralDataSets
• Consistent5-10%reduc0oninworderrorrate(WER)overDNNsonmostdatasets,includingconversa0onalspeech.
• TDNNtrainingspeedsareonparwithDNN,andnearlyanorderofmagnitudefasterthanRNN
StandardASRTestSets Size DNN TDNN Rel.Δ
WallStreetJournal 80hrs 6.6% 6.2% 5%
TED-LIUM 118hrs 19.3% 17.9% 7%
Switchboard 300hrs 15.5% 14.0% 10%
LibriSpeech 960hrs 5.2% 4.8% 7%
FisherEnglish 1800hrs 22.2% 21.0% 5%
ASpIRE(FisherTraining) 1800hrs 47.7% 47.6%
![Page 74: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/74.jpg)
DataAugmenta0onforASRTraining(SeeourpaperatINTERSPEECH2015fordetails)
![Page 75: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/75.jpg)
Simula0ngReverberantSpeechforMul0-condi0on(T)DNNTraining
• Simulateca5500hoursofreverberant,noisydatafrom1800hoursoftheFisherEnglishCTScorpus– Replicateeachoftheca21,000conversa0onsides30mes– Randomlychangethesamplingrate[upto±10%]– Convolveeachconversa0onsidewithoneof320real-liferoomimpulseresponses(RIR)chosenatrandom
– Addnoisetothesignal(whenavailablewiththeRIR)• Generate(T)DNNtraininglabelsfromcleanspeech– Align“pre-reverb”speechtoca7500CD-HMMstates
• TrainDNNandTDNNacous0cmodels– Cross-entropytrainingfollowedbysequencetraining
![Page 76: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/76.jpg)
ResultofDataAugmenta0onAcous$cModel DataAugmenta$on DevWER
TDNNA(230ms) None(1800h,cleanspeech) 47.6%
TDNNA(230ms) +3x(reverbera0on+noise) 31.7%
TDNNB(290ms) +3x(reverbera0on+noise) 30.8%
TDNNA(230ms) +samplingrateperturba0on 31.0%
TDNNB(290ms) +samplingrateperturba0on 31.1%
• Dataaugmenta0onwithsimulatedreverbera0onisbeneficial– Likelytobeaveryimportantreasonforrela0velygoodperformance
• Samplingrateperturba0ondidn’thelpmuchonASpIREdata• SequencetraininghelpedreduceWERonthedevset
– RequiredmodifyingthesMBRtrainingcriteriontorealizegains– Butthegainsdidnotcarryovertodev-testset
![Page 77: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/77.jpg)
i-vectorsforSpeakerCompensa0on(SeeourpaperatINTERSPEECH2015fordetails)
![Page 78: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/78.jpg)
Usingi-vectorsInsteadoffMLLRandusingunnormalizedMFCCstocomputei-vectors
• 100-dimi-vectorsareappendedtoMFCCinputsoftheTDNN– i-vectorsarecomputedfromrawMFCCs(i.e.nomeansubtrac0onetc)– UBMposteriorshoweveruseMFCCsnormalizedovera6secwindow
• i-vectorsarecomputedforeachtraininguVerance– Increasesspeaker-andchannelvariabilityseenintrainingdata– Maymodeltransientdistor0ons?e.g.movingspeakers,passingcars
• i-vectorsarecalculatedforeveryca60secoftestaudio– UBMpriorisweighted10:1topreventovercompensa0on– Weightofteststa0s0csiscappedat75:1rela0vetoUBMsta0s0cs
SpeakerCompensa$onMethod DevWER
TDNNwithouti-vectors 34.8%
+i-vectors(fromallframes) 33.8%
+i-vectors(fromreliablespeechframes) 30.8%
![Page 79: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/79.jpg)
Pronuncia0onandSilenceProbabili0es(SeeourpaperatINTERSPEECH2015fordetails)
![Page 80: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/80.jpg)
Trigram-likeInter-wordSilenceModel
P s a_b( ) = P s a_( )F s _b( )
F s _b( ) =c sb( )+λ3
c !a *b( )P s !a _( )!a∑ +λ3
P s a_( ) =c as( )+λ2P s( )c a( )+λ2
![Page 81: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/81.jpg)
Is“Prosody”FinallyHelpingSTT?Task TestSet Baseline +Sil/Pron
WSJ Eval92 4.1 3.9
Switchboard Eval2000 20.5 20.0
TED-LIUM Test 18.1 17.9
LibriSpeechTestClean 6.6 6.6
TestOther 22.9 22.5
• Modelingpronuncia0onandsilenceprobabili0esyieldsmodestbutconsistentimprovementonmanylargevocabularyASRtasks
Pronuncia$on/SilenceProbabili$es DevWER
Noprobabili0esinthelexicon 32.1%
+pronuncia0onprobabili0es 31.6%
+inter-wordsilenceprobabili0es 30.8%
![Page 82: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/82.jpg)
RecurrentNeuralNetworkbasedLanguageModels
(SeeourpaperatINTERSPEECH2010forthefirst“convincing”results)
![Page 83: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/83.jpg)
RNNLMonASpIREDataLanguageModelandRescoringMethod DevWER
4-gramLMandlancerescoring 30.8%
RNN-LMand100-bestrescoring 30.2%
RNN-LMand1000-bestrescoring 29.9%
RNN-LM(4-gramapproxima0on)lancerescoring 29.9%
RNN-LM(6-gramapproxima0on)lancerescoring 29.8%
• AnRNNLMconsistentlyoutperformstheN-gramLM• TheKaldilancerescoringappearstocausenolossin
performance– Approxima0onentailsnot“expanding”thelancetorepresenteachuniquehistoryseparately
– WhentwopathsmergeinanN-gramlance,onlyones(t)ischosenatrandomandpropagatedforward
![Page 84: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/84.jpg)
TheIARPAASpIRELeaderBoard
Rank Par$cipant DevWER SystemType
1 tsakilidis 27.2% Combina0on
2 rhsiao 27.5% Combina0on
3 vijaypeddin0 27.7% SingleSystem
![Page 85: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/85.jpg)
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
![Page 86: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/86.jpg)
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
![Page 87: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/87.jpg)
PerformanceonEvalua0onData
Acous$cModel LanguageModel DevWER
TestWER
EvalWER
TDNNB(CEtraining) 4-gram 30.8% 27.7% 44.3%
TDNNB(sMBRtraining) 4-gram 29.1% 28.9% 43.9%
TDNNB(CEtraining) RNN 29.8% 26.5% 43.4%
TDNNB(sMBRtraining) RNN 28.3% 28.2% 43.4%
Par$cipant TestWER SystemType
Kaldi 44.3% SingleSystem
BBN(andothers) 44.3% Combina0on
I2R(Singapore) 44.8% Combina0on
![Page 88: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/88.jpg)
KeystoGoodPerformanceonASpIRE
• Timedelayneuralnetworks(TDNN)– Dealwellwithlongreverbera0on0mes
• i-vectorbasedadapta0oncompensa0on– Dealswithspeaker&channelvariability
• Dataaugmenta0onwithsimulatedreverbera0ons– Dealswithchanneldistor0onsnotseenintraining
• Pronuncia0onandinter-wordsilenceprobabili0es– Helpfulinadverseacous0ccondi0ons
![Page 89: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/89.jpg)
TheJHUASpIRESystem(SeeourASRU2015paperfordetails)
![Page 90: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/90.jpg)
Semi-supervisedMMITraining(SeeourpaperatINTERSPEECH2015fordetails)
![Page 91: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/91.jpg)
Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on
θ̂ML = argmaxθ
logP Ot Wt ;θ( )t=1
T
∑ KL P̂ Pθ( )
θ̂MMI = argmaxθ
logP Ot Wt;θ( )
P Ot !Wt;θ( )P !Wt( )!Wt
∑
#
$%
&%
'
(%
)%t=1
T
∑ I W ∧O;θ( )
Cross-en
trop
ytraining
Sequ
encetraining
![Page 92: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/92.jpg)
Semi-SupervisedSequenceTraining
• Sequencetrainingimprovessubstan0allyoverbasiccross-entropytrainingofDNNacous0cmodels
• Semi-supervisedcross-entropytraining–byaddingunlabeleddata–alsoimprovessubstan0allyoverbasiccross-entropytrainingonlabeleddata
• Butsemi-supervisedsequencetrainingis“tricky”– Sensi0vitytoincorrecttranscrip0onseemsgreater– Confidence-basedfilteringorweigh0ngmustbeapplied– Empiricalresultsarenotverysa0sfactory
![Page 93: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/93.jpg)
Semi-supervisedSequenceTraining:withoutcomminngtoasingletranscrip0on
• ViewMMItrainingasminimizingacondi0onalentropy
I W ∧O ;θ( ) =1T
logP Ot Wt ;θ( )P Ot ;θ( )t=1
T
∑ =1T
logP Ot Wt ;θ( )
P Ot #W ; θ )P #W( )(#W∑t=1
T
∑
I W ∧O ;θ( ) = H W( ) − H W O ;θ( ) = H W( ) − 1T
H W Ot ;θ( )t=1
T
∑
• ThelaVerdoesnotrequirecomminngtoasingleWt– Wellsuitedforunlabeledspeech– Entailscompu0ngasumoverallW’sinthelance
![Page 94: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/94.jpg)
Compu0ngLanceEntropyUsingExpecta0onSemi-rings
• Howtoefficientlycompute
• Replacearc-probabili0espiwiththepair(pi,pilog{pi})
−H W Ot ;θ( ) = P π( ) logP π( )π∈L∑
Z Ot ;θ( ) = P π( )π∈L∑
Semi-ringElement&Operators (p,p×log{p})
(p1,p1log{p1})+(p2,p2log{p2}) (p1+p2,p1log{p1}+p2log{p2})
(p1,p1log{p1})×(p2,p2log{p2}) (p1p2,p1p2log{p2}+p2p1log{p1})
• Takeinspira0onfromthecomputa0onof
![Page 95: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/95.jpg)
Semi-supervisedSequenceTraining:KeyDetailsNeededtoMakeitWork
• ViewtrainingcriterionasMCEinsteadofMMI– i.e.argminH(W|O;θ)insteadofargmaxI(W∧O;θ)– EfficientlycomputeH(W|O;θ)forthelance,anditsgradient,viaBaumWelchwithspecialsemi-rings
• Useseparateoutput(sod-max)layersintheDNNforlabeledandunlabeleddata– Inspiredbymul0lingualDNNtrainingmethods
• Useaslightlydifferent“prior”forconver0ngDNNposteriorprobabili0estoacous0clikelihoods
![Page 96: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/96.jpg)
ResultsforSemi-SupervisedMMIonFisherEnglishCTS
DNNTrainingMethod(hoursofspeech) DevWER TestWER
Cross-EntropyTraining(100hlabeled) 32.0 31.2
CE(100hlabeled+250hself-labeled) 30.6 29.8
CE(100hlabeled+250hweighted) 30.5 29.8
SequenceTraining(100hlabeled) 29.6 28.5
SeqTraining(100hlabeled+250hweighted) 29.9 28.8
SeqTraining(100hlabeled+250hMCE) 29.4 28.1
SequenceTraining(350hlabeled) 28.5 27.5
• Recoversabout40%ofthesupervisedtraininggain– Inves0ga0onunderwayfor2000hofunlabeledspeech
• RepeatableresultsonBABELdatasetswith10hsupervisedtraining+50-70hunsupervised
Know
nuseof
unlabe
leddata
Bejeruseof
unlabe
leddata
![Page 97: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/97.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 98: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/98.jpg)
HeterogeneousTrainingCorpora• Transcribedspeechfromdifferentcollec0onsarenoteasy
tomergeforSTTtraining– Genreandspeakingstyledifferences– Differentchannelcondi0ons– Slightlydifferenttranscrip0onconven0ons
• Typicalresult:thecorpusmatchedtotestdatagivesbestSTTresults;othersdon’thelp,some0meshurt!
• SCALE2015casestudywithPashtoCTS– Collectedincountry,andtranscribed,bysamevendor– Roughly80hourseachinthe
• AppenLILAcorpusandIARPABABELcorpus– Pronuncia0onlexicontocovertranscripts;samephoneset
![Page 99: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/99.jpg)
AStudyinPashto(Amanuscriptisinprepara0onforfuturepublica0on)
![Page 100: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/100.jpg)
AStudyinPashto• Transcrip0onsrequire
extensivecross-corpusnormaliza0on
• Evenaderthat,languagemodelsdon’tbenefitmuchfromcorpuspooling
• Simplecorpuspoolingdoesn’timproveacous0cmodelingeither
• DNNswithshared“inner”layersandcorpus-specificinputandoutputlayersworkbest
TrainingData
SingleModel
Interpola$onWeightsLMALMBLMT
InterpolatedModel
TextA 99.2 0.8 0.2 0.0 92.9
TextB 141.9 0.1 0.8 0.1 140.0
TextT 86.7 0.0 0.0 1.0 86.7
Mul$-corpus(A+B)TrainingStrategy
STTWordErrorRatesTestATestBTestT
SharedDNNlayers(except1) 53.2% 47.4% 27.0%
SharedDNNlayers(except2) 51.2% 45.0% 25.4%
+Op0mizedLanguageModel 50.8% 44.8% 25.4%
+Dura0onModeling 50.4% 44.3% 24.8%
DNNTrainingData STTWordErrorRatesTestATestBTestT
Singlecorpus(matched) 55.4% 46.8% 24.8%
Twocorpora(PashtoA+B) 51.9% 48.2% 52.6%
![Page 101: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/101.jpg)
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
![Page 102: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/102.jpg)
OtherAddi0onsandInnova0ons• Semi-supervised(MMI)training– Usingunlabeledspeechtoaugmentalimitedtranscribedspeechcorpus
• Mul0lingualacous0cmodeltraining– Usingother-languagespeechtoaugmentalimitedtranscribedspeechcorpus
• Removingrelianceonpronuncia0onlexicons– Graphemebasedmodelsandacous0callyaidedG2P
• Chainmodels– 10%moreaccurateSTT,plus– 3xfasterdecoding,and5x-10xfastertraining
![Page 103: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/103.jpg)
TheGenesisofChainModels
• Connec0onistTemporalClassifica0on– Thelatestshinytoyinneuralnetwork-basedacous0cmodelingforSTT(ICASSPandInterSpeech2015)
– NiceSTTimprovementsshownonGoogledatasets– Wehaven’tseenSTTgainsonourdatasets
• ChainModels– Inspiredby(butquitedifferentfrom)CTC– SequencetrainingofNNswithoutCEpre-training– NiceSTTimprovementsoverpreviousbestsystems– 3xdecoding0mespeed-up;5x-10xtrainingspeed-up
![Page 104: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/104.jpg)
2006:ANewKidontheNNetBlock
![Page 105: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/105.jpg)
2015:TheNewKidComesofAge
![Page 106: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/106.jpg)
2015:TheNewKidComesofAge
![Page 107: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/107.jpg)
CTC,Explained…inPicturesFigurefromGravesetal,ICML2006
dh ax s aw n d
![Page 108: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/108.jpg)
CTC,Explained…inPicturesFigurefromGravesetal,ICML2006
dh ax s aw n d
β
ββ
β β β β
![Page 109: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/109.jpg)
DNNversusCTC:STTPerformanceFiguresandTablesfromSaketal,ICASSP2015
DNN Target CE sMBR
LSTM Senone 10.0% 8.9%
BLSTM Senone 9.7% 9.1%
CTC Target CE sMBR
LSTM Phone 10.5% 9.4%
BLSTM Phone 9.5% 8.5%
![Page 110: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/110.jpg)
First,theBadNews…
• Wehaven’tbeenabletogetCTCmodelstogiveusanyno0ceableimprovementoverourbest(TDNNorLSTM-RNN)modelsonourdata– Itappearstobeeasiertogetthemtoworkwhenonehasseveral1000hoursoflabeledspeech
– Butwecareaboutlower-resourcescenarios
![Page 111: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/111.jpg)
…andthentheGoodNews
• Weareabletogetsimilarimprovementsusingadifferentmodel,whichisinspiredbyideasfromtheCTCpapers– Usesimple“1-state”HMMsforeachCDphone– Reduceframeratefrom100Hzto33Hz– Permitslackintheframe-to-statealignment
![Page 112: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/112.jpg)
ChainModelsandLF-MMITraining
• Anewclassofacous0cmodelsforhybridSTT– “1-state”HMMforeachcontext-dependentphone– LSTM/TDNNscomputestateposteriorprobabili0es
• MFCCsaredown-sampledfrom100Hzto33Hz– InspiredbyCTC
• Anewlance-freeMMItrainingmethod– Improvedparalleliza0on,sequencetrainingonGPUs
• Largermini-batches,smallerI/Obandwidth– DoesnotrequireCEtrainingbeforeMMItraining– Uses“flexiblelabelalignment”inspiredbyCTC
![Page 113: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/113.jpg)
Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on
θ̂ML = argmaxθ
logP Ot Wt ;θ( )t=1
T
∑ KL P̂ Pθ( )
θ̂MMI = argmaxθ
logP Ot Wt;θ( )
P Ot !Wt;θ( )P !Wt( )!Wt
∑
#
$%
&%
'
(%
)%t=1
T
∑ I W ∧O;θ( )
![Page 114: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/114.jpg)
Lance-FreeMMITraining
• Denominator(phone)graphcrea0on– Useaphone4-gramlanguagemodel,L– ComposeH,CandLtoobtaindenominatorgraph
• ThisFSAisthesameforalluVerances;suitsGPUtraining• Use(heuris0c)sentence-specificini0alprobabili0es
• Numeratorgraphcrea0on– Generateaphonegraphusingtranscripts
• ThisFSAencodesframe-by-framealignmentofHMMstates– Permitsomealignment“slack”foreachframe/label– IntersectslackenedFSAwiththedenominatorFSA
![Page 115: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/115.jpg)
Regulariza$on Hub-5‘00WordErrorRate
CrossEntropy L2Norm LeakyHMM Total SWBDN N N 16.8% 11.1%Y N N 15.9% 10.5%N Y N 15.9% 10.4%N N Y 16.4% 10.9%Y Y N 15.7% 10.3%Y N Y 15.7% 10.3%N Y Y 15.8% 10.4%Y Y Y 15.6% 10.4%
Lance-freeMMITraining(cont’d)• LSTM-RNNstrainedwiththisMMItrainingprocedureare
highlysuscep0bletoover-finng• Essen0altoregularizetheNNtrainingprocess
– AsecondoutputlayerforCEtraining– OutputL2regulariza0on– UsealeakyHMM
![Page 116: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/116.jpg)
STTResultsforChainModels300hoursofSWBDTrainingSpeech;Hub-5‘00Evalua0onSet
TrainingObjec$ve Model(Size) TotalWER
SWBDWER
Cross-Entropy TDNNA(16.6M) 18.2% 12.5%
CE+sMBR TDNNA(16.6M) 16.9% 11.4%
Lance-freeMMI
TDNNA(9.8M) 16.1% 10.7%
TDNNB(9.9M) 15.6% 10.4%
TDNNC(11.2M) 15.5% 10.2%
LF-MMI+sMBR TDNNC(11.2M) 15.1% 10.0%
• LF-MMIreducesWERbyca10%-15%rela>ve• LF-MMIisbeVerthanstandardCE+sMBRtraining(ca8%)• LF-MMIimprovesveryslightlywithaddi0onalsMBRtraining
![Page 117: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/117.jpg)
ChainModelsandLF-MMITrainingSTTPerformanceonaVarietyofCorpora
CorpusandAudioType
TrainingSpeech
CE+sMBRErrorRate
LF-MMIErrorRate
AMIIHM 80hours 23.8% 22.4%
AMISDM 80hours 48.9% 46.1%
TED-LIUM 118hours 11.3% 12.8%
Switchboard 300hours 16.9% 15.5%
Fisher+SWBD 2100hours 15.0% 13.3%
• ChainmodelswithLF-MMIreduceWERby6%-11%(rela>ve)• LF-MMIimprovesabitfurtherwithaddi0onalsMBRtraining• FL-MMIis5x-10xfastertotrain,3xfastertodecode
![Page 118: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/118.jpg)
ARecapofChainModels
• Anewclassofacous0cmodelsforhybridSTT– “1-state”HMMforcontext-dependentphones– LSTM-RNNacous0cmodels(TDNNalsocompa0ble)
• Anewlance-freeMMItrainingmethod– BeVersuitedtousingGPUsforparalleliza0on– DoesnotrequireCEtrainingbeforeMMItraining
• ImprovedspeedandSTTperformance– 6%-8%rela0veWERreduc0onoverpreviousbest– 5-10ximprovementintraining0me;3xdecoding0me
![Page 119: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/119.jpg)
SummaryofAdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)– ChainmodelsforbeVerSTT,fasterdecoding(2017)
• andthelistgoeson...
![Page 120: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/120.jpg)
TeamKaldi@JohnsHopkins• SanjeevKhudanpur• DanielPovey• JanTrmal• GuoguoChen• PegahGhahremani• VimalManohar• VijayadityaPeddin0• HainanXu• XiaohuiZhang• …andseveralothers
![Page 121: Building a State-of-the-Art ASR System with Kaldi](https://reader030.vdocuments.us/reader030/viewer/2022013123/5867c5d61a28ab15408be737/html5/thumbnails/121.jpg)
KaldiPoints-of-Contact
• Kaldimailinglist– [email protected]
• DanielPovey– [email protected]
• Jan“Yenda”Trmal– [email protected]
• SanjeevKhudanpur– [email protected]– 410-516-7024