automatic speech recognition (asr) intro - nnov.hse.ru€¦ · conventional asr pipeline 1. am:...

30
Automatic Speech Recognition (ASR) intro Sokolov Artem

Upload: others

Post on 22-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Automatic Speech

Recognition (ASR) intro

Sokolov Artem

Page 2: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Motivation

• Streaming speech

• Voice commands/search

• Key World Spotting (KWS)

• TTS

• NLP/NLU

Neighbor technologies:

Page 3: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Families

• Hybrid (DNN+HMM)

• E2E

Page 4: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Conventional ASR pipeline

1. AM:

MFCC - > o3 o7 o7 o1 o9 o9 o9 o5

2. HMM:

o3 o7 o7 o1 o9 o9 o9 o5kaet

3. PM (Lexicon):

k ae tcat

4. LM:

the catthe cat

Acoustic

Model Language

Model

Pronunciatio

n Model

Feature

Extractio

n Text

Page 5: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

ASR basics

- Phoneme is a minimum speech item

- Biphone, Triphone (senone) context depended

phoneme

- Frame window is ~ 10-50 ms with overlapping

- Used features:

- MFCC

- PLP

- LPCC

- FBANK

- MELSPEC

- ETSI - AFE

- PNCC

- …

Page 6: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Conventional ASR pipeline

• All components are trained separately.

• Language Model is a prior information about language

• Pronunciation Model (lexicon) defines phoneme sets for

the particular language

Acoustic

Model Language

Model

Pronunciatio

n Model

Feature

Extractio

n Text

Page 7: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Acoustic and pronunciation models

LAY L EY

PLACE P L EY S

SET S EH T

RED R EH D

GREEN G R IY N

BLUE B L UW

WHITE W AY T

Image credit: Park H. et al. A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network //arXiv preprint arXiv:1807.05855. –

2018.

PM example:

TDNN/BLSTM based are the SOFA for conventional (hybrid)

systems.

Page 8: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Language model

N-grams, Weighted Finite State Transducers (WFST) • [Mohri et al.] M. Mohri, F. Pereira, M. Riley, “Weighted Finite-State Transducers in Speech

Recognition”

• Most of ASR frameworks

RNN • [Mikolov et al., 2010] T. Mikolov, M. Karafiat, L. Burget, “Recurrent neural network based

language model”

• [Kannan et al., 2017] A. Kannan, Y. Wu, P. Nguyen, “An analysis of incorporating an

external language model into a sequence-to-sequence model”

• ESPnet, EESEN end-to-end ASR frameworks

Transformers • [Li et al., 2019] J. Li, V. Lavrukhin, B. Ginsburg, “Jasper: An End-to-End Convolutional

Neural Acoustic Model”, April 2019

Page 9: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Weighted Finite State Transducers

0 3HUAWEI

1 2is AI

cool

0 3HUAWEI

1 2is AI/0.3

cool/0.7

0 10H:HUAWEI

1U:-

2A:-

3W:-

4 5E:- I:-

6i:is

7s:-

8

11

c:cool

12o:- o:-

13

l:-

9A:AI I:-

0 10H:HUAWEI

1U:-

2A:-

3W:-

4 5E:- I:-

6i:is

7s:-

8

11

c:cool/0.7

12o:- o:-

13

l:-

9A:AI/0.3 I:-

Finite State Acceptor:

Weighted Finite State Acceptor:

Finite State Transducer:

Weighted Finite State Transducer:

9

Page 10: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Accuracy measure

- Commonly used accuracy measure for spontaneous speech is Word Error Rate (WER)

- Relative number of correct utterances (for commands)

S - substitutions (wrongly recognised commands)

N - total number decoded commands

Page 11: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Datasets*. Data augmentation

Corpus Description Size

Librispeech Audio Books ~1000h

WSJ Reading of Wall Street Journal texts 80h

Fisher Telephone speech 2000h

Switchboard Telephone speech 300h

• * https://www.ldc.upenn.edu/

• ** https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html

Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.

• Speed change

• Natural and synthetic noise

• …

• SpecAugment**

Augmentation:

Page 12: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Frameworks & Vendors

• Kaldi

• TF

• Pytorch

• Google

• Amazon

• Facebook

• Microsoft

• Baidu

• Yandex

• Speech Technology

Center

• Nuance

• …

Page 13: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

E2E models

Image credits: http://ericbattenberg.com/

A system which directly maps a sequence of input acoustic features

into a sequence of graphemes or words.

Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.

Page 14: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

End-To-End Models Overview

AM Submission Architecture N of layers LibriSpeech* test-clean

WER%

DeepSpeech

(Baidu/Mozilla) 2014 3FC + BiLSTM + FC -> CTC 5 5.66

DeepSpeech2 2015 2 CNN + 7 BiRNN + 2 FC → CTC 11 5.33

Wav2Letter 2016 12xCNN -> ASG/CTC 12 7.2

ESPnet 2018 CNN (VGG)+N BiLSTM →

CTC/CTC+Attention ~16+8 4.0

Wav2Letter+ 2018 Residual CNN + 19xCNN → CTC 19 3.44

Jasper 2019

Residual

Nx(CNN+BN+ReLU+Dropout) –>

CTC

54 2.95

Page 15: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Research developments on end-to-end

models towards productionisation

• Attention - Pushing the limit of attention-based end-to-end

models.

• Online models – Streaming models for real world application

• RNN-T

• Neural Transducers

Page 16: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Questions?

Page 17: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

GNA

Page 18: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Hardware acceleration. GNA

GNA is a ASIC in Intel ® CPUs

- Gemini Lake

- Cannon Lake

- Ice Lake

- Elkhart Lake

GNA driver and library to provide the API

- 1.0 Gold

- 2.0 Pre-Alfa

Intel® GNA - Gaussian Mixture Models and Neural Networks Accelerator.

IE GNA plugin to provide compatibility with famous frameworks (Kaldi, TF)

- Gold????

Page 19: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

IE GNA plugin + lib + HW place

Image credit: https://www.techrepublic.com/article/how-we-learned-to-talk-to-computers/

GNA input: feature vectors

GNA Output: second likelihoods (in lattices)

Page 20: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

GNA library and driver

- Two versions of GNA library existed. GNA 2.0 is in pre-alfa status

- Several modes of acceleration are supported

- HW

- SW generic

- SW specific (SSE4_2, AVX1, AVX2)

- SW exact

Image credit: GNA 2.0 API official documentation

Model loading in pinned RAM memory

Page 21: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

GNA plugin routine

• Convolutional layer

• Recurrent layer

• Diagonal layer

• Affine layer

• Copy layer

• PWL

Page 22: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Supported layers

Layers

Activation-Clamp

Activation-Leaky ReLU

Activation-ReLU

Activation-Sigmoid/Logistic

Activation-TanH

Concat

Convolution-Ordinary

Eltwise-Mul

Eltwise-Sum

FullyConnected, Diagonal

Memory

Layers

Permute

Pooling(AVG,MAX)

Power

Split/Slice

Reshape

ROIPooling

ScaleShift

Page 23: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Supported precisions

Level GNA Plugin

Model format FP32, I16

Computational precision FP32*, I16, I8

Output FP32, I16

Page 24: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Quantisation. Scaling. Mixed precision for weight

and biases Quantisation is a hint to the GNA plugin regarding the preferred target weight resolution for all layers.

Quantisation modes:

• static

• dynamic (not implemented, but potentially supported)

• user-defined

fp32

I16 I8

Compound

biases

Page 25: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Compound biases

I8 typedef struct

{

int32_t bias;

uint8_t multiplier;

char reserved[3];

} intel_compound_bias_t;

Page 26: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

LSTM

Page 27: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Tests

Engineering

Unit tests

(functions, model parts; i8, i16, fp32, +

(fp16 input))

Functional tests

(models, compare with MKLDNN on

i8, i16 precisions)

Behaviour tests

(Behaviour cases where failures

expected)

Additionally: we have layers dumping and weights similarity

measurements mechanisms for debugging purposes.

QA

WER

E2E

Sample

Page 28: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

IE GNA plugin vs IE MKLDNN plugin I

0

2,75

5,5

8,25

11

13,75

rm_lstm 10 utterances

Sec

GNA HW i8 (single) GNA SW i8 (single) MKLDNN fp32 (multi)

Intel® Core™ Silver J5005 CPU @ 1.50GHz (GeminiLake)

1. LSTMProjectedStream with 43 inputs, 512 cells, and 200 outputs

2. LSTMProjectedStream with 200 inputs, 512 cells, and 200 outputs

3. AffineTransform with 200 inputs and 1494 outputs

4. Softmax.

The network outputs are senone class likelihoods.

The model is provided with class counts so that the softmax layer may be removed (with no drop in accuracy).

Page 29: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

IE GNA plugin vs IE MKLDNN plugin

0

0,25

0,5

0,75

1

DNN CNN LSTM

Infer time Per Frame Correlation

MKLDNN fp32 (multi) GNA i8 SW (multi*) Challenges made on Intel® Core™ i7-6770K

Single utterance with 8192 frames, batch_size=1 for CNN and RNN; 262144 frames, batch_size=8 for DNN.

*New changes for for async requests supporting by GNA plugin. Not released yet.

Multithreading mode specialities: MKLDNN uses all physical cores (4). GNA 8 async requests in parallel.

Page 30: Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM: MFCC - > o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 2. HMM: o 3 o 7 o 7 o 1 o 9 o 9 o 9 o 5 kaet

Questions?