"automatic speech recognition for mobile applications in yandex" — fran campillo,...
DESCRIPTION
This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian. Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.TRANSCRIPT
1
2
Automatic speech recognition for mobile applications in Yandex
Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo
3
OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
4
MotivationMotivation
5
MotivationMotivation
6
Road mapRoad map
7
Road mapRoad map
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
8
Automatic speech recognitionAutomatic speech recognition
9
ASR: complexityASR: complexity
Style Planned Spontaneous
Audio quality CD Telephone
Vocabulary size Hundreds Hundreds of thousands
Number of speakers One Many
Recognition rate WorseWorseBetterBetter
Complexity BiggerBiggerSmallerSmaller
10
Word pronunciationsWord pronunciations
● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/
● Need a mapping from writing to phonemes: G2P.
11
Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j
12
Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/
13
ASR: the problemASR: the problem
● We have a sequence of observations:– O = {o
1, o
2, …, o
T}
– oi is a feature vector representing a speech frame.
● Goal: finding the likeliest sequence of words wi
for O:argmax
iP (w i /O)argmax
iP (w i /O)
14
ASR: the problem (II)ASR: the problem (II)
● We cannot compute directly P(wi/O).
● Bayes: P(wi /O)=P (O /w i)P (w i)
P (O)
argmaxiP (w i /O)=argmax
i{P (O /w i)P (w i)}
Acoustic model Language model
15
Language modelLanguage model
● Probability of sequences of words:– “We will rock you” => P
1.
– “Will will rock you” => P2.
● Trained on large corpora.● The closer to the application domain, the better.
16
Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models
● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe
● Typical layout for ASR:
Q1
Q2
Q3
a11
a12
a22
a23
a33
b1(o) b
2(o) b
3(o)
● aij: transition probabilities.
● bj(o): probability of observation o in state j.
17
Acoustic model: HMM and speechAcoustic model: HMM and speech
● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.
● aij: duration of each part.
● bj(o): probability of producing a vector of features o in
state j.
18
Modeling probability of observationModeling probability of observation● Gaussian mixtures:
– cjm
= weight of mth Gaussian of state j.– μ
jm => average (vector) of mth Gaussian of state j.
– ∑jm
=> covariance matrix of mth Gaussian of state j.
● Neural networks.
b j(x)=∑m c jmN (x ,μ jm ,Σ jm)
19
Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/
to1
o2
o3
o4
o5
o6
o7
o8
o9
o10
/o//o/
Q1
Q2
Q3
Q1 => o
1, o
2
Q2 => o
3, o
4, o
5, o
6, o
7
Q3 = > o
8, o
9, o
10μ
3m, ∑
3m, c
3m
μ2m,
∑2m,
c2m
μ1m,
∑1m,
c1m
20
Block diagram for trainingBlock diagram for training
Initialization
Baum-Welch
HMM Parameters update
Convergence
Prototype HMM
No
Trained models
Yes
Initial μ0m,
∑j0m, com
for the GMMs
Alignments of the training sentences (observations to states)
New estimations for μ
jm, ∑
jm, c
jm
Training sentences
21
DecodingDecoding
●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.
Parametrize
Lexicon Acousticmodels
Languagemodel
DecoderSpeech signal Words
22
Our decoderOur decoder● Based on Weighted Finite State transducers.
●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.
Lexicon Acousticmodels
Languagemodel
HCLG
23
Composition of WFST: exampleComposition of WFST: example
Lexicon
Language model
0 1B:Bob2
ah: 3b:
4
l: likes
5ay: k: 6
s:
24
Data collectionData collection
25
Data collectionData collection
● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.
26
Manual transcriptionsManual transcriptions
● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.
Percentage (%)
Native 87.7
Male 83.3
Female 8.5
Child 8.2
27
Manual transcriptionsManual transcriptions
● Percentage of records without anomalies: 7.4%
Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1
28
Manual transcriptions: examplesManual transcriptions: examples
● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр
29
VisualQAVisualQA
30
ExperimentsExperiments
31
Grapheme-2-phonemeGrapheme-2-phoneme
● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.
● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.
● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).
32
Noise modelsNoise models
33
Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model
34
Experiments: number of GaussiansExperiments: number of Gaussians
35
ResultsResults
36
Users: NavigatorUsers: Navigator
37
● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):
Results: relative word error rateResults: relative word error rate
Maps Navigation General search
Yandex-GMM 1 1 1
3rd Party 44.6% 31.8% 37.3%
Competitor 1.9% -9.7% -23.4%
General searchYandex-DNN 1
Competitor 6.6%
38
Thanks for your attention!Thanks for your attention!
39
Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer
Yandex Speech GroupYandex Speech Group
[email protected]@yandex-team.ru
PhDPhD