Download - Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Deep learning for (more than)speech recognition

IndabaX Western Cape, UCT, Apr. 2018

Herman Kamper

E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/

http://www.kamperh.com/

Page 2: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

Page 3: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

Page 4: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40

Page 5: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Page 6: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

2 / 40

Page 7: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

2. Examples of non-ASR speech processing

(the first rant)

2 / 40

Page 8: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

2 / 40

Page 9: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

3. Examples of local work

(a second rant)

2 / 40

Page 10: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Talk outline

2 / 40

Page 11: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

State-of-the-art speech recognition

Page 12: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

4 / 40

Page 13: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Feature extraction for speech processing

25 ms

10 ms

X

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

5 / 40

Page 14: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Feature extraction for speech processing

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

6 / 40

Page 15: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Name these networks

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 16: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Name these networks

x(i)

y(i)

8 / 40

Page 17: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Name these networks

Image: http://deeplearning.net/tutorial/lenet.html9 / 40

http://deeplearning.net/tutorial/lenet.html

Page 18: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Name these networks

p(x(1),x(2), . . . ,x(N))|[ih])

10 / 40

Page 19: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 20: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Page 21: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 22: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 23: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 24: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 25: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 26: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 27: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 28: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

12 / 40

Page 29: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Page 30: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Page 31: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]

14 / 40

Page 32: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]14 / 40

Page 33: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

End-to-end speech recognition

Page 34: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 35: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 36: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

Page 37: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

y(i)s1 s2 . . . s9000

18 / 40

Page 38: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

y(i)s1 s2 . . . s9000

18 / 40

Page 39: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

• Can be seen asrepresentation learningtrained jointly with classifier

x(i)

y(i)s1 s2 . . . s9000

18 / 40

Page 40: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

y(i)s1 s2 . . . s9000

18 / 40

Page 41: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

Page 42: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

Page 43: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

https://github.com/espnet/espnet

Page 44: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

Page 45: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40

Page 46: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 47: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 48: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Page 49: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

21 / 40

Page 50: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

21 / 40

Page 51: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

21 / 40

Page 52: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Rant 1: Do we always need/have ASR?

Examples of non-ASR speech processing

Page 53: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What if we do not have supervision?

• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

• Data: 2000 hours transcribed speech audio; ∼350M/560M words text

• Can we do this for all 7000 languages spoken in the world?

• Many of these languages are endangered and unwritten

23 / 40

Page 54: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Example 1: Query-by-example search

Spoken query:Spoken query:Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Page 55: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Spoken query:

Spoken query:Spoken query:Spoken query:

Page 56: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Spoken query:

Spoken query:Spoken query:

Page 57: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Spoken query:Spoken query:

Spoken query:

Page 58: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Spoken query:Spoken query:Spoken query:

Spoken query:

Page 59: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/25 / 40

http://www.stevenbird.net/

Page 60: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/26 / 40

http://www.stevenbird.net/

Page 61: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Example 3: Learning robots to understand speech

[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40

Page 62: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Rant 2: Taking inspiration from humans

Examples of local work

Page 63: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Can we acquire language from audio alone?

29 / 40

Page 64: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Can we acquire language from audio alone?

29 / 40

Page 65: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Page 66: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Full-coverage segmentation and clustering

31 / 40

Page 67: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

31 / 40

Page 68: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

31 / 40

Page 69: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 70: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 71: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 72: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 73: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 74: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 75: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Page 76: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Listen to discovered clusters

• Small-vocabulary cluster 45: Play

• Large-vocabulary English cluster 1214: Play

• Large-vocabulary Xitsonga cluster 629: Play

33 / 40

Page 77: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Arrival

Page 78: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Page 79: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Page 80: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]

36 / 40

Page 81: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]36 / 40

Page 82: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Page 83: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung

Page 84: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung

Page 85: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung

Page 86: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung

Page 87: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung

Page 88: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung

Page 89: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play

largeplaysittingyellowyoung

Page 90: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung

Page 91: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung

Page 92: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play

playsittingyellowyoung

Page 93: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung

Page 94: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung

Page 95: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic

Page 96: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Summary and conclusion

Page 97: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What did we chat about today?

• Supervised speech recognition: From HMMs all the way to CLDNNs

• Structure is still important in speech recognition

• Saw three examples of models that do not require ASR

• Looked at local work taking inspiration from humans

39 / 40

Page 98: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What’s next (specifically for us)?

• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

Page 99: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

40 / 40

Page 100: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

40 / 40

Page 101: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

40 / 40

Page 102: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

40 / 40

Page 103: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

40 / 40

Page 104: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

40 / 40

Page 105: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

https://github.com/kamperh

Page 106: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Backup slides

Page 107: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Acoustic word embeddings (AWe)

Acoustic word embeddings x ∈ RD

fe(Y1)

fe(Y2)

Y2

Y1

x1

x2

[Levin et al., ASRU’13]43 / 40

Page 108: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Page 109: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Page 110: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Retrieval in common (semantic) space

y ∈ RD in D-dimensional space

yvis

yspch

[Harwath et al., NIPS’16]45 / 40

Page 111: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]

46 / 40

Page 112: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 113: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 114: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 115: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 116: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 117: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 118: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Page 119: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

Download - Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Top Related