"automatic speech recognition for mobile applications in yandex" — fran campillo,...

Automatic speech recognition for mobile applications in Yandex

Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo

OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

MotivationMotivation

Road mapRoad map

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

Automatic speech recognitionAutomatic speech recognition

ASR: complexityASR: complexity

Style Planned Spontaneous

Audio quality CD Telephone

Vocabulary size Hundreds Hundreds of thousands

Number of speakers One Many

Recognition rate WorseWorseBetterBetter

Complexity BiggerBiggerSmallerSmaller

Word pronunciationsWord pronunciations

● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/

● Need a mapping from writing to phonemes: G2P.

Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j

Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/

ASR: the problemASR: the problem

● We have a sequence of observations:– O = {o

2, …, o

– oi is a feature vector representing a speech frame.

● Goal: finding the likeliest sequence of words wi

for O:argmax

iP (w i /O)argmax

iP (w i /O)

ASR: the problem (II)ASR: the problem (II)

● We cannot compute directly P(wi/O).

● Bayes: P(wi /O)=P (O /w i)P (w i)

argmaxiP (w i /O)=argmax

i{P (O /w i)P (w i)}

Acoustic model Language model

Language modelLanguage model

● Probability of sequences of words:– “We will rock you” => P

– “Will will rock you” => P2.

● Trained on large corpora.● The closer to the application domain, the better.

Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models

● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe

● Typical layout for ASR:

b1(o) b

2(o) b

● aij: transition probabilities.

● bj(o): probability of observation o in state j.

Acoustic model: HMM and speechAcoustic model: HMM and speech

● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.

● aij: duration of each part.

● bj(o): probability of producing a vector of features o in

state j.

Modeling probability of observationModeling probability of observation● Gaussian mixtures:

– cjm

= weight of mth Gaussian of state j.– μ

jm => average (vector) of mth Gaussian of state j.

– ∑jm

=> covariance matrix of mth Gaussian of state j.

● Neural networks.

b j(x)=∑m c jmN (x ,μ jm ,Σ jm)

Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/

/o//o/

Q1 => o

Q2 => o

Q3 = > o

3m, ∑

∑2m,

∑1m,

Block diagram for trainingBlock diagram for training

Initialization

Baum-Welch

HMM Parameters update

Convergence

Prototype HMM

Trained models

Initial μ0m,

∑j0m, com

for the GMMs

Alignments of the training sentences (observations to states)

New estimations for μ

jm, ∑

Training sentences

DecodingDecoding

●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.

Parametrize

Lexicon Acousticmodels

Languagemodel

DecoderSpeech signal Words

Our decoderOur decoder● Based on Weighted Finite State transducers.

●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.

Lexicon Acousticmodels

Languagemodel

Composition of WFST: exampleComposition of WFST: example

Lexicon

Language model

0 1B:Bob2

ah: 3b:

l: likes

5ay: k: 6

Data collectionData collection

● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.

Manual transcriptionsManual transcriptions

● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.

Percentage (%)

Native 87.7

Male 83.3

Female 8.5

Child 8.2

Manual transcriptionsManual transcriptions

● Percentage of records without anomalies: 7.4%

Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1

Manual transcriptions: examplesManual transcriptions: examples

● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр

VisualQAVisualQA

ExperimentsExperiments

Grapheme-2-phonemeGrapheme-2-phoneme

● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.

● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.

● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).

Noise modelsNoise models

Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model

Experiments: number of GaussiansExperiments: number of Gaussians

ResultsResults

Users: NavigatorUsers: Navigator

● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):

Results: relative word error rateResults: relative word error rate

Maps Navigation General search

Yandex-GMM 1 1 1

3rd Party 44.6% 31.8% 37.3%

Competitor 1.9% -9.7% -23.4%

General searchYandex-DNN 1

Competitor 6.6%

Thanks for your attention!Thanks for your attention!

Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer

Yandex Speech GroupYandex Speech Group

francampillo@yandex-team.rufrancampillo@yandex-team.ru

PhDPhD

"automatic speech recognition for mobile applications in yandex" — fran campillo,...

Technology

Сергей Петренко. Тенденции и...

search marketing in russia: seo con yandex, ppc con yandex...

yandex search autumn_2008_ru

yandex - it jobs

kotlin @ csclub & yandex

yandex metrika

advertise with yandex

cole road & avenida campillo -...

Иван Ямщиков, Яндекс

yandex wg-talk

yandex on russia_map_autumn_2009

t1 e1 campillo

Уютненько о Яндекс Директ

francisca sandoval campillo, socio-cultural aspects: the...

Яндекс. Аналитика для банков и...

yandex on moscow_traffic_jams_summer_2007

1-844-334-9858|yandex tech support phone number usa...

marketing jazz яндекс-клиенты найдутся

yandex on blogosphere_autumn_2007

yandex n.v