the use of context in large vocabulary speech recognition
Post on 06-Jan-2016
33 Views
Preview:
DESCRIPTION
TRANSCRIPT
1048576 The Use of Context in Large Vocabulary Speech Recognition
Julian James OdellMarch 1995
Dissertation submitted to the University ofCambridge for the degree of Doctor of Philosophy
Presenter Hsu-Ting Wei
2
3
Context
4
Contents (cont)
5
Introduction
bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context
dependencies both within words and across word boundaries
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
2
3
Context
4
Contents (cont)
5
Introduction
bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context
dependencies both within words and across word boundaries
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
3
Context
4
Contents (cont)
5
Introduction
bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context
dependencies both within words and across word boundaries
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
4
Contents (cont)
5
Introduction
bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context
dependencies both within words and across word boundaries
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
5
Introduction
bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context
dependencies both within words and across word boundaries
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
6
Introduction (cont)
bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree
bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are
dependent upon contextual effects occurring across word boundaries
bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of
using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little
computational overhead
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
7
Ch3 Context dependency in speech
bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog
nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses
bull Signal parameterisationbull Model structure
ndash Ensure that their between class variance is higher than the within class variance
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
8
Ch3 Context dependency in speech (cont)
bull Most of the variability inherent in speech is due to contextual effectsndash Session effects
bull Speaker effects ndash Major source of variation
bull Environmental effectsndash Control by minimizing the background noise and
ensuring that the same microphone is usedndash Local effects
bull Utterancendash Co-articulation stress emphasis
bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
9
Ch3 Context dependency in speech (cont)
bull Session effectsndash Speaker dependent system (SD) is significantly more accurate
than a similar speaker independent system (SI)ndash Speaker effects
bull Gender and agebull Dialectbull Style
ndash In order to making the SI system to simulate SD system we can do
bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
10
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Operating recognizers in parallel
bull Disadvantage ndash The computational load appears to rises linearly with the number of
systemsbull Advantage
ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech
Speaker typeanswer
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
11
Ch3 Context dependency in speech (cont)
bull Session effects (cont)ndash Adapting the recognizer to match the new speaker
bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use
parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker
MAPMLLR
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
12
Ch3 Context dependency in speech (cont)
bull Local effectsndash Co-articulation means that the acoustic realization of a phone in
a particular phonetic context is more consistent than the same phone occurring in a variety of contexts
ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
13
Ch3 Context dependency in speech (cont)
bull Local effectsndash Context Dependent Phonetic Models
bull IN LIMSIndash 45 monophone context (Festival CMU 41)
raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)
raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context
raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil
ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)
ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil
bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy
ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
14
English dictionary
bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones
(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
15
English dictionary (cont)
bull The LIMSI dictionary phones set (1993)ndash 45 phones
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
16
Linguistic knowledge (cont)
鼻音摩擦音流音
bull General questions
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
17
Linguistic knowledge (cont)
bull Vowel questions
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
18
Linguistic knowledge (cont)
bull Consonant questions
發音時很用力的子音發音較不費力的子音
舌尖音
刺耳的
音節主音
摩擦音
破擦音
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
19
Linguistic knowledge (cont)
bull Questions which is used in HTK
lt= State tying
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
20
Ch4Decoding
bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM
bull It is concerned with the use of cross word context dependent acoustic and long span language models
bull Ideal decoderndash 42 Time-Synchronous decoding
bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation
ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition
ndash 44 A Hybrid approach
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
21
Ch4Decoding (cont)
41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe
sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
22
Ch4Decoding (cont)
41 Requirements (cont)ndash The ideal decoder would have following characteristics
bull Efficiency Ensure that the system does not lag behind the speaker
bull Accuracy Find the most likely grammatical sequence of words for each utterance
bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary
bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
23
Conclusion
bull Implement HTK right biphone task and triphone task
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
-
top related