![Page 1: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/1.jpg)
Deep learning for (more than)speech recognition
IndabaX Western Cape, UCT, Apr. 2018
Herman Kamper
E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/
![Page 2: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/2.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
![Page 3: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/3.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
![Page 4: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/4.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40
![Page 5: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/5.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 6: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/6.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 7: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/7.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing
(the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 8: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/8.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 9: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/9.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work
(a second rant)
2 / 40
![Page 10: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/10.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 11: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/11.jpg)
State-of-the-art speech recognition
![Page 12: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/12.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
4 / 40
![Page 13: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/13.jpg)
Feature extraction for speech processing
25 ms
10 ms
X
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
5 / 40
![Page 14: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/14.jpg)
Feature extraction for speech processing
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
6 / 40
![Page 15: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/15.jpg)
Name these networks
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40
![Page 16: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/16.jpg)
Name these networks
x(i)
y(i)
8 / 40
![Page 17: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/17.jpg)
Name these networks
Image: http://deeplearning.net/tutorial/lenet.html9 / 40
![Page 18: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/18.jpg)
Name these networks
p(x(1),x(2), . . . ,x(N))|[ih])
10 / 40
![Page 19: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/19.jpg)
![Page 20: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/20.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 21: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/21.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 22: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/22.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 23: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/23.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 24: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/24.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 25: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/25.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 26: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/26.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 27: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/27.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 28: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/28.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 29: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/29.jpg)
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
![Page 30: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/30.jpg)
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
![Page 31: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/31.jpg)
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]
14 / 40
![Page 32: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/32.jpg)
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]14 / 40
![Page 33: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/33.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]15 / 40
![Page 34: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/34.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]16 / 40
![Page 35: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/35.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]17 / 40
![Page 36: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/36.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 37: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/37.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 38: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/38.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 39: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/39.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier
x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 40: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/40.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 41: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/41.jpg)
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
![Page 42: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/42.jpg)
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
![Page 43: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/43.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
![Page 44: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/44.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
![Page 45: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/45.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 46: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/46.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 47: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/47.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 48: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/48.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 49: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/49.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 50: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/50.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 51: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/51.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 52: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/52.jpg)
Rant 1: Do we always need/have ASR?
Examples of non-ASR speech processing
![Page 53: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/53.jpg)
What if we do not have supervision?
• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
• Data: 2000 hours transcribed speech audio; ∼350M/560M words text
• Can we do this for all 7000 languages spoken in the world?
• Many of these languages are endangered and unwritten
23 / 40
![Page 54: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/54.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 55: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/55.jpg)
Example 1: Query-by-example search
Spoken query:
Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 56: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/56.jpg)
Example 1: Query-by-example search
Spoken query:
Spoken query:
Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 57: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/57.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:
Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 58: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/58.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 59: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/59.jpg)
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/25 / 40
![Page 60: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/60.jpg)
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/26 / 40
![Page 61: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/61.jpg)
Example 3: Learning robots to understand speech
[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40
![Page 62: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/62.jpg)
Rant 2: Taking inspiration from humans
Examples of local work
![Page 63: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/63.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
![Page 64: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/64.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
![Page 65: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/65.jpg)
![Page 66: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/66.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 67: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/67.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 68: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/68.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 69: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/69.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 70: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/70.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 71: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/71.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 72: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/72.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 73: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/73.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 74: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/74.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 75: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/75.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 76: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/76.jpg)
Listen to discovered clusters
• Small-vocabulary cluster 45: Play
• Large-vocabulary English cluster 1214: Play
• Large-vocabulary Xitsonga cluster 629: Play
33 / 40
![Page 77: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/77.jpg)
Arrival
![Page 78: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/78.jpg)
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
![Page 79: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/79.jpg)
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
![Page 80: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/80.jpg)
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]
36 / 40
![Page 81: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/81.jpg)
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]36 / 40
![Page 82: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/82.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 83: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/83.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 84: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/84.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 85: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/85.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 86: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/86.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 87: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/87.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 88: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/88.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 89: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/89.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play
largeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 90: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/90.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 91: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/91.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 92: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/92.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play
playsittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 93: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/93.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 94: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/94.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 95: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/95.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic
[Kamper et al., Interspeech’17]37 / 40
![Page 96: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/96.jpg)
Summary and conclusion
![Page 97: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/97.jpg)
What did we chat about today?
• Supervised speech recognition: From HMMs all the way to CLDNNs
• Structure is still important in speech recognition
• Saw three examples of models that do not require ASR
• Looked at local work taking inspiration from humans
39 / 40
![Page 98: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/98.jpg)
What’s next (specifically for us)?
• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 99: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/99.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 100: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/100.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 101: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/101.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 102: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/102.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 103: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/103.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 104: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/104.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 105: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/105.jpg)
http://www.kamperh.com/
https://github.com/kamperh
![Page 106: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/106.jpg)
Backup slides
![Page 107: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/107.jpg)
Acoustic word embeddings (AWe)
Acoustic word embeddings x ∈ RD
fe(Y1)
fe(Y2)
Y2
Y1
x1
x2
[Levin et al., ASRU’13]43 / 40
![Page 108: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/108.jpg)
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
![Page 109: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/109.jpg)
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
![Page 110: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/110.jpg)
Retrieval in common (semantic) space
y ∈ RD in D-dimensional space
yvis
yspch
[Harwath et al., NIPS’16]45 / 40
![Page 111: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/111.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]
46 / 40
![Page 112: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/112.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 113: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/113.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 114: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/114.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 115: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/115.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 116: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/116.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 117: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/117.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 118: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/118.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 119: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.vdocuments.us/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/119.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40