"automatic speech recognition for mobile applications in yandex" — fran campillo,...

39

Upload: yandex

Post on 05-Dec-2014

1.535 views

Category:

Technology


3 download

DESCRIPTION

This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian. Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.

TRANSCRIPT

Page 1: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

1

Page 2: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

2

Automatic speech recognition for mobile applications in Yandex

Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo

Page 3: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

3

OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

Page 4: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

4

MotivationMotivation

Page 5: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

5

MotivationMotivation

Page 6: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

6

Road mapRoad map

Page 7: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

7

Road mapRoad map

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

Page 8: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

8

Automatic speech recognitionAutomatic speech recognition

Page 9: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

9

ASR: complexityASR: complexity

Style Planned Spontaneous

Audio quality CD Telephone

Vocabulary size Hundreds Hundreds of thousands

Number of speakers One Many

Recognition rate WorseWorseBetterBetter

Complexity BiggerBiggerSmallerSmaller

Page 10: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

10

Word pronunciationsWord pronunciations

● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/

● Need a mapping from writing to phonemes: G2P.

Page 11: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

11

Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j

Page 12: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

12

Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/

Page 13: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

13

ASR: the problemASR: the problem

● We have a sequence of observations:– O = {o

1, o

2, …, o

T}

– oi is a feature vector representing a speech frame.

● Goal: finding the likeliest sequence of words wi

for O:argmax

iP (w i /O)argmax

iP (w i /O)

Page 14: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

14

ASR: the problem (II)ASR: the problem (II)

● We cannot compute directly P(wi/O).

● Bayes: P(wi /O)=P (O /w i)P (w i)

P (O)

argmaxiP (w i /O)=argmax

i{P (O /w i)P (w i)}

Acoustic model Language model

Page 15: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

15

Language modelLanguage model

● Probability of sequences of words:– “We will rock you” => P

1.

– “Will will rock you” => P2.

● Trained on large corpora.● The closer to the application domain, the better.

Page 16: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

16

Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models

● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe

● Typical layout for ASR:

Q1

Q2

Q3

a11

a12

a22

a23

a33

b1(o) b

2(o) b

3(o)

● aij: transition probabilities.

● bj(o): probability of observation o in state j.

Page 17: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

17

Acoustic model: HMM and speechAcoustic model: HMM and speech

● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.

● aij: duration of each part.

● bj(o): probability of producing a vector of features o in

state j.

Page 18: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

18

Modeling probability of observationModeling probability of observation● Gaussian mixtures:

– cjm

= weight of mth Gaussian of state j.– μ

jm => average (vector) of mth Gaussian of state j.

– ∑jm

=> covariance matrix of mth Gaussian of state j.

● Neural networks.

b j(x)=∑m c jmN (x ,μ jm ,Σ jm)

Page 19: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

19

Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/

to1

o2

o3

o4

o5

o6

o7

o8

o9

o10

/o//o/

Q1

Q2

Q3

Q1 => o

1, o

2

Q2 => o

3, o

4, o

5, o

6, o

7

Q3 = > o

8, o

9, o

10μ

3m, ∑

3m, c

3m

μ2m,

∑2m,

c2m

μ1m,

∑1m,

c1m

Page 20: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

20

Block diagram for trainingBlock diagram for training

Initialization

Baum-Welch

HMM Parameters update

Convergence

Prototype HMM

No

Trained models

Yes

Initial μ0m,

∑j0m, com

for the GMMs

Alignments of the training sentences (observations to states)

New estimations for μ

jm, ∑

jm, c

jm

Training sentences

Page 21: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

21

DecodingDecoding

●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.

Parametrize

Lexicon Acousticmodels

Languagemodel

DecoderSpeech signal Words

Page 22: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

22

Our decoderOur decoder● Based on Weighted Finite State transducers.

●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.

Lexicon Acousticmodels

Languagemodel

HCLG

Page 23: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

23

Composition of WFST: exampleComposition of WFST: example

Lexicon

Language model

0 1B:Bob2

ah: 3b:

4

l: likes

5ay: k: 6

s:

Page 24: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

24

Data collectionData collection

Page 25: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

25

Data collectionData collection

● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.

Page 26: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

26

Manual transcriptionsManual transcriptions

● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.

Percentage (%)

Native 87.7

Male 83.3

Female 8.5

Child 8.2

Page 27: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

27

Manual transcriptionsManual transcriptions

● Percentage of records without anomalies: 7.4%

Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1

Page 28: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

28

Manual transcriptions: examplesManual transcriptions: examples

● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр

Page 29: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

29

VisualQAVisualQA

Page 30: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

30

ExperimentsExperiments

Page 31: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

31

Grapheme-2-phonemeGrapheme-2-phoneme

● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.

● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.

● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).

Page 32: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

32

Noise modelsNoise models

Page 33: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

33

Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model

Page 34: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

34

Experiments: number of GaussiansExperiments: number of Gaussians

Page 35: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

35

ResultsResults

Page 36: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

36

Users: NavigatorUsers: Navigator

Page 37: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

37

● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):

Results: relative word error rateResults: relative word error rate

Maps Navigation General search

Yandex-GMM 1 1 1

3rd Party 44.6% 31.8% 37.3%

Competitor 1.9% -9.7% -23.4%

General searchYandex-DNN 1

Competitor 6.6%

Page 38: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

38

Thanks for your attention!Thanks for your attention!

Page 39: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

39

Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer

Yandex Speech GroupYandex Speech Group

[email protected]@yandex-team.ru

PhDPhD