![Page 1: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/1.jpg)
Complex-valued Linear Layers for Deep NeuralNetwork-based Acoustic Models for Speech
Recognition
Izhak Shafran
Joint work with: Ehsan Variani, Arun Narayanan, and Tara Sainath
Google Inc
![Page 2: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/2.jpg)
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
![Page 3: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/3.jpg)
Automatic Speech Recognition
![Page 4: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/4.jpg)
Automatic Speech Recognition: Language ModelsScores likelihood of utterances in a given language
P(”call home”) > P(”call roam”)
P(w1w2) = P(w1|begin)P(w2|w1)P(end |w1w2)
P(”call home”) = P(”call”|begin) . . .
![Page 5: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/5.jpg)
Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information
![Page 6: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/6.jpg)
Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information
![Page 7: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/7.jpg)
Automatic Speech Recognition: Acoustic ModelsScores the conditional likelihood that the hypothesized words were spoken
![Page 8: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/8.jpg)
Acoustic Models Illustration: Vowel classification task
Consider the (mag) FFT of a frame of (vowel) speech
Notice energy peaks (formants)Turns out, their positions are different for different vowels
![Page 9: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/9.jpg)
Acoustic Models Illustration: Vowel classification taskValues of (F1,F2) for vowels in a training set
![Page 10: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/10.jpg)
Acoustic Models Illustration: Vowel classification taskScore the conditional likelihood that a hypothesized vowel was observed
P(X |æ) > P(X |ε)
![Page 11: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/11.jpg)
Automatic Speech Recognition: Decoder
return ”call home” if
P(X |”call home”)P(”call home”) > P(X |”call roam”)P(”call roam”)
Mathematical justification,
W = argmaxW
P(W |X )
= argmaxW
P(X |W )P(W )
P(X )
= argmaxW
P(X |W )P(W )
![Page 12: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/12.jpg)
(Deep) Neural NetworksIllustration with a simple feed forward network
![Page 13: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/13.jpg)
A Node in a Neural Network
![Page 14: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/14.jpg)
Neural NetworksIllustration of Forward Propagation
![Page 15: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/15.jpg)
Neural Networks: LearningIllustration of Back Propagation Algorithm
home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
Multistyle Training (MTR) improves robustness
![Page 16: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/16.jpg)
Deep Neural Networks: Why are they so hot now?
I Universal approximatorSuperposition Theorem of Kolmogorov (1957)A Computable Kolmogorov S. T. (Brattka, 2000)Neural Network with Unbounded Activation Functions is aUniversal Approximator (Sonoda & Murata, 2015)
I Back propagation algorithm(Linnainmaa, 1970; Rummelhart et al, 1986)
I Efficient BackProp (LeCun 1998),Fast Learning Algorithms for Deep Belief Net (Hinton, 2006)
![Page 17: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/17.jpg)
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
![Page 18: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/18.jpg)
Motivation
In most ASR systems, features extraction:
I uses 3 decades old hand-crafted rules
I mimics human perception
I may not be optimal for speech recognition
Since neural networks are universal approximators,why not let the network learn the optimal features too !?
![Page 19: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/19.jpg)
Recall: Feature Extraction
I Most problematic stage is Mel-filterbank
I Loss of information
I Log-mel features: useless for dereverberation or beamforming,precludes integration into neural networks
![Page 20: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/20.jpg)
Problem
Can we feed the speech frames directly as input to the neuralnetworks and learn the features jointly with acoustic models?
![Page 21: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/21.jpg)
One Approach: Learnable Time Domain Filters (Hoshen et al, 2015)
I Skip transforming the signal into frequency domain
I Replace mel-filters with a convolutional neural network layer
I Learn the parameters of the layer (filter) from the data
I Implementation inspired by gammatone features
![Page 22: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/22.jpg)
Gammatone Features
All the above operations can be approximated in the neural network
![Page 23: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/23.jpg)
Learnable Time Domain Filters (Hoshen et al, 2015)
![Page 24: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/24.jpg)
Learned Time Domain Filters
![Page 25: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/25.jpg)
Learnable Time Domain Filters (Hoshen et al, 2015)
But:
I convolution layer has several parameters to tune
I computationally expensive
![Page 26: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/26.jpg)
Convolution
![Page 27: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/27.jpg)
Convolutional Layer
I Convolution layers use only the ”valid” outputs
I Trade-off: Longer filters =⇒ fewer valid outputs
![Page 28: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/28.jpg)
Time Domain Filters: Filter Length Dependencies
I For 64ms frame, there are still more settings to sweep!
I Typically, max pooling works better than average pooling
I Info loss, cannot cascade filters in the signal processing sense
![Page 29: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/29.jpg)
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
![Page 30: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/30.jpg)
Convolution in Time ≡ Multiplication in Frequency Domain
![Page 31: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/31.jpg)
Multiplication in Frequency Doman: Complex-Valued
X[k] H[k] Y[k]
I X[k] is complex-valued
I H[k] is complex-valued
I Need a linear layer with complex-valued neurons!
![Page 32: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/32.jpg)
Complex-valued Neuron
Let X := Xr + jXi , W := Wr + jWi , B := Br + jBi
Z = g(WX − B)
Zr = g(WrXr −WiXi − Br )
Zi = g(WiXr + WrXi − Bi )
I Easily implemented in a neural net w/ shared weights
I How is this different from using g([WrWi ][XrXi ]T − B)?
![Page 33: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/33.jpg)
Decision Boundaries are Orthogonal!
Easy to see from the norm to the boundary (Tohru Nitta, 2004)
Zr = g(WrXr −WiXi − Br )
Zi = g(WiXr + WrXi − Bi )
Qr =
[∂Zr
∂Xr
∂Zr
∂Xi
]=
[g ′(. . .) g ′(. . .)
] [ Wr
−Wi
]Qi =
[g ′(. . .) g ′(. . .)
] [ Wi
Wr
]QT
r Qi = 0 =⇒ Qr ⊥ Qi
![Page 34: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/34.jpg)
Complex FFT Input & Complex-valued Linear Layers: The Appeal
I Orthogonal boundaries =⇒ higher modeling capacitye.g., XOR with one neuron!
I Gradient descent is faster
I Does not have trade-offs of convolutional layers
I L1 regularization is meaningful, unlike in time domain
I No max-pooling, allows cascading of filterse.g., Dereverberation + Beamforming + Feature extraction
I Possibility to learn hybrid filters that optimize both signalprocessing cost function and WER
![Page 35: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/35.jpg)
Complex-valued Linear Layers: Single Filter
![Page 36: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/36.jpg)
Complex-valued Linear Layers: Set of Filters
![Page 37: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/37.jpg)
A model is a lie that helps you see the truth.Howard Skipper
![Page 38: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/38.jpg)
Summation in the Frequency Domain
LemmaSummation in the frequency domain ≡ weighted average poolingin the time domain.If X is the Fourier transform of the 2N point signal x , then
N∑k=0
Xk =2N−1∑n=0
αnxn
where,
αn =
N + 1 n = 0
coth(j πi2N
)mod (n, 2) = 1
1 mod (n, 2) = 0, n 6= 0
![Page 39: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/39.jpg)
Complex Linear Projection
Proposition
The projection in the frequency domain is equivalent toconvolution followed by a weighted average pooling:
N∑j=0
WijXj ←→2N−1∑j=0
αj (wi ∗ x) [j ]
![Page 40: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/40.jpg)
Complex Linear Projection (CLP)
![Page 41: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/41.jpg)
ASR Experimental Setup
I Training set: 2000 hrs, 3M anonymized voice search queriesI SNR between 0-20dB (≈12 dB)I Reverb T60 between 400ms-900ms (≈ 600ms)I 8 channel linear mic with spacing of 2cmI Noise and target speaker location constant per utterance
I Acoustic modelI A cascade of 3 LSTMs followed by a DNN layerI 13K HMM statesI Learning with asynchronized stochastic gradient descent
I Test set: Voice search w/ matching conditions
I All results reported here are after CE training
![Page 42: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/42.jpg)
Experimental Settings & WER Results
I 25 ms, 32 ms, 64 ms
I 128 filters
System 25 ms 32 ms 64 ms
Log-mel 23.4 22.8 21.8Time-domain conv. 23.7 23.4 22.5CLP 23.2 22.8 22.0
I CLP is at least as good as the alternatives!
![Page 43: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/43.jpg)
Learned CLP Filters: Before and After Applying L1 Regularization
![Page 44: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/44.jpg)
Learned Filters: Histogram of Learned Filter Coefficients
Time-domain filter coefficients (in time & in freq)
CLF filter coefficients (w/o & w/ L1)
CLP filters are sparse!
![Page 45: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/45.jpg)
Complex Linear Projection: Two Channels
![Page 46: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/46.jpg)
Experimental Settings & WER Results
I Two channels from microphones, 14cm apart
I 25 ms, 32 ms, 64 ms
I 256 filters
System 25 ms 32 ms 64 ms
Log-mel 21.8 21.3 20.7Time-domain conv. 21.5 21.2 21.2CLP 21.5 20.9 20.5
I Again, CLP is at least as good as the other alternatives!
![Page 47: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/47.jpg)
Multichannel CLP: Frame Size vs Projection Size
I Frame size: 32ms, 64ms
I Projection size: 128, 256, 512
I Without L1 regularization and stop bands
128 256 512
32ms 21.7 21.4 21.764ms 21.1 21.1 21.5
I Higher projections size =⇒ larger input to LSTM, slowerStep time is about 1.5 × slower going from 256 to 512
I Frame size has no impact for 32ms and 64ms
I Step time is about 2 × slower going from 1 to 8 channels
![Page 48: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/48.jpg)
Learned CLP Filters: Two Channel
![Page 49: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/49.jpg)
Learned CLP Filters: Time-delay between Real and Imaginary
Filter number vs time delay
![Page 50: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/50.jpg)
Histogram of Time-delay between Real and Imaginary
Time delay vs frequency of occurrence
![Page 51: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/51.jpg)
Language and Task Dependency
I English high-pitch voice search (en-us-hp)
I Taiwanese voice search (cmn-hant-tw)
System en-us-hp cmn-hant-tw
Log-mel 16.6 17.2Time-domain conv. 16.3 16.8CLP 16.4 16.6
I CLP as good or better than the alternatives!
![Page 52: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/52.jpg)
Learned Filters: Sorted by Center Frequency
![Page 53: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/53.jpg)
Computational Efficiency
I Time-domain filters ≈ O(Nkp + k2p))k strides, 2N window size, p filters
I CLP ≈ O(pN)
Model Time Filters CLP
num-params 45K 66Knum-add-mult 14.51 M 263.17 K
I Computation cost is a magnitude lower!
![Page 54: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016 · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:](https://reader033.vdocuments.us/reader033/viewer/2022053022/604d36544c2b8a50845a1b57/html5/thumbnails/54.jpg)
Summary & Future Directions
I Introduced complex-valued linear layers for acoustic modeling
I Complex-valued linear projection (CLP) as one instance of it
I Showed a simple multi-channel extension for CLP
I Reported experimental comparisons with log-mel andtime-domain filters
I Demonstrated task-dependent advantage
I Discussed properties of the CLP model
I Future directions:More efficient multi-channel projections, hybrid systems withWER and beamforming-type cost functions, adding multiplecomplex layers with appropriate complex-valued activationfunctions, integrating de-reverberation, . . .