automatic speech recognition (asr) intro - nnov.hse.ru€¦ · conventional asr pipeline 1. am:...
TRANSCRIPT
Automatic Speech
Recognition (ASR) intro
Sokolov Artem
Motivation
• Streaming speech
• Voice commands/search
• Key World Spotting (KWS)
• TTS
• NLP/NLU
Neighbor technologies:
Families
• Hybrid (DNN+HMM)
• E2E
Conventional ASR pipeline
1. AM:
MFCC - > o3 o7 o7 o1 o9 o9 o9 o5
2. HMM:
o3 o7 o7 o1 o9 o9 o9 o5kaet
3. PM (Lexicon):
k ae tcat
4. LM:
the catthe cat
Acoustic
Model Language
Model
Pronunciatio
n Model
Feature
Extractio
n Text
ASR basics
- Phoneme is a minimum speech item
- Biphone, Triphone (senone) context depended
phoneme
- Frame window is ~ 10-50 ms with overlapping
- Used features:
- MFCC
- PLP
- LPCC
- FBANK
- MELSPEC
- ETSI - AFE
- PNCC
- …
Conventional ASR pipeline
• All components are trained separately.
• Language Model is a prior information about language
• Pronunciation Model (lexicon) defines phoneme sets for
the particular language
Acoustic
Model Language
Model
Pronunciatio
n Model
Feature
Extractio
n Text
Acoustic and pronunciation models
LAY L EY
PLACE P L EY S
SET S EH T
RED R EH D
GREEN G R IY N
BLUE B L UW
WHITE W AY T
Image credit: Park H. et al. A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network //arXiv preprint arXiv:1807.05855. –
2018.
PM example:
TDNN/BLSTM based are the SOFA for conventional (hybrid)
systems.
Language model
N-grams, Weighted Finite State Transducers (WFST) • [Mohri et al.] M. Mohri, F. Pereira, M. Riley, “Weighted Finite-State Transducers in Speech
Recognition”
• Most of ASR frameworks
RNN • [Mikolov et al., 2010] T. Mikolov, M. Karafiat, L. Burget, “Recurrent neural network based
language model”
• [Kannan et al., 2017] A. Kannan, Y. Wu, P. Nguyen, “An analysis of incorporating an
external language model into a sequence-to-sequence model”
• ESPnet, EESEN end-to-end ASR frameworks
Transformers • [Li et al., 2019] J. Li, V. Lavrukhin, B. Ginsburg, “Jasper: An End-to-End Convolutional
Neural Acoustic Model”, April 2019
Weighted Finite State Transducers
0 3HUAWEI
1 2is AI
cool
0 3HUAWEI
1 2is AI/0.3
cool/0.7
0 10H:HUAWEI
1U:-
2A:-
3W:-
4 5E:- I:-
6i:is
7s:-
8
11
c:cool
12o:- o:-
13
l:-
9A:AI I:-
0 10H:HUAWEI
1U:-
2A:-
3W:-
4 5E:- I:-
6i:is
7s:-
8
11
c:cool/0.7
12o:- o:-
13
l:-
9A:AI/0.3 I:-
Finite State Acceptor:
Weighted Finite State Acceptor:
Finite State Transducer:
Weighted Finite State Transducer:
9
Accuracy measure
- Commonly used accuracy measure for spontaneous speech is Word Error Rate (WER)
- Relative number of correct utterances (for commands)
S - substitutions (wrongly recognised commands)
N - total number decoded commands
Datasets*. Data augmentation
Corpus Description Size
Librispeech Audio Books ~1000h
WSJ Reading of Wall Street Journal texts 80h
Fisher Telephone speech 2000h
Switchboard Telephone speech 300h
• * https://www.ldc.upenn.edu/
• ** https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html
Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.
• Speed change
• Natural and synthetic noise
• …
• SpecAugment**
Augmentation:
Frameworks & Vendors
• Kaldi
• TF
• Pytorch
• Amazon
• Microsoft
• Baidu
• Yandex
• Speech Technology
Center
• Nuance
• …
E2E models
Image credits: http://ericbattenberg.com/
A system which directly maps a sequence of input acoustic features
into a sequence of graphemes or words.
Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.
End-To-End Models Overview
AM Submission Architecture N of layers LibriSpeech* test-clean
WER%
DeepSpeech
(Baidu/Mozilla) 2014 3FC + BiLSTM + FC -> CTC 5 5.66
DeepSpeech2 2015 2 CNN + 7 BiRNN + 2 FC → CTC 11 5.33
Wav2Letter 2016 12xCNN -> ASG/CTC 12 7.2
ESPnet 2018 CNN (VGG)+N BiLSTM →
CTC/CTC+Attention ~16+8 4.0
Wav2Letter+ 2018 Residual CNN + 19xCNN → CTC 19 3.44
Jasper 2019
Residual
Nx(CNN+BN+ReLU+Dropout) –>
CTC
54 2.95
Research developments on end-to-end
models towards productionisation
• Attention - Pushing the limit of attention-based end-to-end
models.
• Online models – Streaming models for real world application
• RNN-T
• Neural Transducers
Questions?
GNA
Hardware acceleration. GNA
GNA is a ASIC in Intel ® CPUs
- Gemini Lake
- Cannon Lake
- Ice Lake
- Elkhart Lake
GNA driver and library to provide the API
- 1.0 Gold
- 2.0 Pre-Alfa
Intel® GNA - Gaussian Mixture Models and Neural Networks Accelerator.
IE GNA plugin to provide compatibility with famous frameworks (Kaldi, TF)
- Gold????
IE GNA plugin + lib + HW place
Image credit: https://www.techrepublic.com/article/how-we-learned-to-talk-to-computers/
GNA input: feature vectors
GNA Output: second likelihoods (in lattices)
GNA library and driver
- Two versions of GNA library existed. GNA 2.0 is in pre-alfa status
- Several modes of acceleration are supported
- HW
- SW generic
- SW specific (SSE4_2, AVX1, AVX2)
- SW exact
Image credit: GNA 2.0 API official documentation
Model loading in pinned RAM memory
GNA plugin routine
• Convolutional layer
• Recurrent layer
• Diagonal layer
• Affine layer
• Copy layer
• PWL
Supported layers
Layers
Activation-Clamp
Activation-Leaky ReLU
Activation-ReLU
Activation-Sigmoid/Logistic
Activation-TanH
Concat
Convolution-Ordinary
Eltwise-Mul
Eltwise-Sum
FullyConnected, Diagonal
Memory
Layers
Permute
Pooling(AVG,MAX)
Power
Split/Slice
Reshape
ROIPooling
ScaleShift
Supported precisions
Level GNA Plugin
Model format FP32, I16
Computational precision FP32*, I16, I8
Output FP32, I16
Quantisation. Scaling. Mixed precision for weight
and biases Quantisation is a hint to the GNA plugin regarding the preferred target weight resolution for all layers.
Quantisation modes:
• static
• dynamic (not implemented, but potentially supported)
• user-defined
fp32
I16 I8
Compound
biases
Compound biases
I8 typedef struct
{
int32_t bias;
uint8_t multiplier;
char reserved[3];
} intel_compound_bias_t;
LSTM
Tests
Engineering
Unit tests
(functions, model parts; i8, i16, fp32, +
(fp16 input))
Functional tests
(models, compare with MKLDNN on
i8, i16 precisions)
Behaviour tests
(Behaviour cases where failures
expected)
Additionally: we have layers dumping and weights similarity
measurements mechanisms for debugging purposes.
QA
WER
E2E
Sample
IE GNA plugin vs IE MKLDNN plugin I
0
2,75
5,5
8,25
11
13,75
rm_lstm 10 utterances
Sec
GNA HW i8 (single) GNA SW i8 (single) MKLDNN fp32 (multi)
Intel® Core™ Silver J5005 CPU @ 1.50GHz (GeminiLake)
1. LSTMProjectedStream with 43 inputs, 512 cells, and 200 outputs
2. LSTMProjectedStream with 200 inputs, 512 cells, and 200 outputs
3. AffineTransform with 200 inputs and 1494 outputs
4. Softmax.
The network outputs are senone class likelihoods.
The model is provided with class counts so that the softmax layer may be removed (with no drop in accuracy).
IE GNA plugin vs IE MKLDNN plugin
0
0,25
0,5
0,75
1
DNN CNN LSTM
Infer time Per Frame Correlation
MKLDNN fp32 (multi) GNA i8 SW (multi*) Challenges made on Intel® Core™ i7-6770K
Single utterance with 8192 frames, batch_size=1 for CNN and RNN; 262144 frames, batch_size=8 for DNN.
*New changes for for async requests supporting by GNA plugin. Not released yet.
Multithreading mode specialities: MKLDNN uses all physical cores (4). GNA 8 async requests in parallel.
Questions?