"automatic speech recognition for mobile applications in yandex" — fran campillo,...

Post on 05-Dec-2014

1.535 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian. Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.

TRANSCRIPT

1

2

Automatic speech recognition for mobile applications in Yandex

Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo

3

OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

4

MotivationMotivation

5

MotivationMotivation

6

Road mapRoad map

7

Road mapRoad map

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

8

Automatic speech recognitionAutomatic speech recognition

9

ASR: complexityASR: complexity

Style Planned Spontaneous

Audio quality CD Telephone

Vocabulary size Hundreds Hundreds of thousands

Number of speakers One Many

Recognition rate WorseWorseBetterBetter

Complexity BiggerBiggerSmallerSmaller

10

Word pronunciationsWord pronunciations

● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/

● Need a mapping from writing to phonemes: G2P.

11

Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j

12

Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/

13

ASR: the problemASR: the problem

● We have a sequence of observations:– O = {o

1, o

2, …, o

T}

– oi is a feature vector representing a speech frame.

● Goal: finding the likeliest sequence of words wi

for O:argmax

iP (w i /O)argmax

iP (w i /O)

14

ASR: the problem (II)ASR: the problem (II)

● We cannot compute directly P(wi/O).

● Bayes: P(wi /O)=P (O /w i)P (w i)

P (O)

argmaxiP (w i /O)=argmax

i{P (O /w i)P (w i)}

Acoustic model Language model

15

Language modelLanguage model

● Probability of sequences of words:– “We will rock you” => P

1.

– “Will will rock you” => P2.

● Trained on large corpora.● The closer to the application domain, the better.

16

Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models

● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe

● Typical layout for ASR:

Q1

Q2

Q3

a11

a12

a22

a23

a33

b1(o) b

2(o) b

3(o)

● aij: transition probabilities.

● bj(o): probability of observation o in state j.

17

Acoustic model: HMM and speechAcoustic model: HMM and speech

● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.

● aij: duration of each part.

● bj(o): probability of producing a vector of features o in

state j.

18

Modeling probability of observationModeling probability of observation● Gaussian mixtures:

– cjm

= weight of mth Gaussian of state j.– μ

jm => average (vector) of mth Gaussian of state j.

– ∑jm

=> covariance matrix of mth Gaussian of state j.

● Neural networks.

b j(x)=∑m c jmN (x ,μ jm ,Σ jm)

19

Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/

to1

o2

o3

o4

o5

o6

o7

o8

o9

o10

/o//o/

Q1

Q2

Q3

Q1 => o

1, o

2

Q2 => o

3, o

4, o

5, o

6, o

7

Q3 = > o

8, o

9, o

10μ

3m, ∑

3m, c

3m

μ2m,

∑2m,

c2m

μ1m,

∑1m,

c1m

20

Block diagram for trainingBlock diagram for training

Initialization

Baum-Welch

HMM Parameters update

Convergence

Prototype HMM

No

Trained models

Yes

Initial μ0m,

∑j0m, com

for the GMMs

Alignments of the training sentences (observations to states)

New estimations for μ

jm, ∑

jm, c

jm

Training sentences

21

DecodingDecoding

●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.

Parametrize

Lexicon Acousticmodels

Languagemodel

DecoderSpeech signal Words

22

Our decoderOur decoder● Based on Weighted Finite State transducers.

●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.

Lexicon Acousticmodels

Languagemodel

HCLG

23

Composition of WFST: exampleComposition of WFST: example

Lexicon

Language model

0 1B:Bob2

ah: 3b:

4

l: likes

5ay: k: 6

s:

24

Data collectionData collection

25

Data collectionData collection

● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.

26

Manual transcriptionsManual transcriptions

● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.

Percentage (%)

Native 87.7

Male 83.3

Female 8.5

Child 8.2

27

Manual transcriptionsManual transcriptions

● Percentage of records without anomalies: 7.4%

Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1

28

Manual transcriptions: examplesManual transcriptions: examples

● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр

29

VisualQAVisualQA

30

ExperimentsExperiments

31

Grapheme-2-phonemeGrapheme-2-phoneme

● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.

● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.

● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).

32

Noise modelsNoise models

33

Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model

34

Experiments: number of GaussiansExperiments: number of Gaussians

35

ResultsResults

36

Users: NavigatorUsers: Navigator

37

● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):

Results: relative word error rateResults: relative word error rate

Maps Navigation General search

Yandex-GMM 1 1 1

3rd Party 44.6% 31.8% 37.3%

Competitor 1.9% -9.7% -23.4%

General searchYandex-DNN 1

Competitor 6.6%

38

Thanks for your attention!Thanks for your attention!

39

Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer

Yandex Speech GroupYandex Speech Group

francampillo@yandex-team.rufrancampillo@yandex-team.ru

PhDPhD

top related