prosody modeling and eigen-prosody analysis for robust speaker recognition
DESCRIPTION
PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION. Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui CHU. Reference. - PowerPoint PPT PresentationTRANSCRIPT
PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST
SPEAKER RECOGNITION
Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang
ICASSP 2005
Presenter Fang-Hui CHU
2005102420051024 Speech Lab NTNU 22
ReferenceReference
Combination of Acoustic and Prosodic Information for Robust SpCombination of Acoustic and Prosodic Information for Robust Speaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Cheeaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Chen and Yau-Tarng Juang ROCLING 2005n and Yau-Tarng Juang ROCLING 2005
Eigen-Prosody Analysis for Robust Speaker Recognition under Eigen-Prosody Analysis for Robust Speaker Recognition under Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang ICSLP 2004Yau-Tarng Juang ICSLP 2004
2005102420051024 Speech Lab NTNU 33
OutlineOutline
IntroductionIntroduction
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
Eigen-Prosody AnalysisEigen-Prosody Analysis
Speaker Identification ExperimentsSpeaker Identification Experiments
ConclusionsConclusions
2005102420051024 Speech Lab NTNU 44
IntroductionIntroduction
A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets
Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch
Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models
May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours
2005102420051024 Speech Lab NTNU 55
Introduction (cont)Introduction (cont)
Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics
DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities
usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance
In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 22
ReferenceReference
Combination of Acoustic and Prosodic Information for Robust SpCombination of Acoustic and Prosodic Information for Robust Speaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Cheeaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Chen and Yau-Tarng Juang ROCLING 2005n and Yau-Tarng Juang ROCLING 2005
Eigen-Prosody Analysis for Robust Speaker Recognition under Eigen-Prosody Analysis for Robust Speaker Recognition under Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang ICSLP 2004Yau-Tarng Juang ICSLP 2004
2005102420051024 Speech Lab NTNU 33
OutlineOutline
IntroductionIntroduction
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
Eigen-Prosody AnalysisEigen-Prosody Analysis
Speaker Identification ExperimentsSpeaker Identification Experiments
ConclusionsConclusions
2005102420051024 Speech Lab NTNU 44
IntroductionIntroduction
A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets
Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch
Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models
May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours
2005102420051024 Speech Lab NTNU 55
Introduction (cont)Introduction (cont)
Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics
DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities
usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance
In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 33
OutlineOutline
IntroductionIntroduction
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
Eigen-Prosody AnalysisEigen-Prosody Analysis
Speaker Identification ExperimentsSpeaker Identification Experiments
ConclusionsConclusions
2005102420051024 Speech Lab NTNU 44
IntroductionIntroduction
A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets
Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch
Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models
May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours
2005102420051024 Speech Lab NTNU 55
Introduction (cont)Introduction (cont)
Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics
DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities
usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance
In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 44
IntroductionIntroduction
A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets
Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch
Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models
May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours
2005102420051024 Speech Lab NTNU 55
Introduction (cont)Introduction (cont)
Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics
DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities
usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance
In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 55
Introduction (cont)Introduction (cont)
Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics
DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities
usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance
In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 66
Introduction (cont)Introduction (cont)
Speakerrsquos enrollment
speech
Sequences ofProsody states
EPA
VQ-Based Prosodic Modeling
Text document records the detailprosodyspeaking
style of the speaker
Eigen-prosody space to representthe constellation
of speakers LSA
By this way the speaker identification problem is transformed into a
full-text document retrieval-similar task
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 77
Introduction (cont)Introduction (cont)
Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other
The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 88
VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling
In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units
Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment
lengthening factor of a vowel segmentlengthening factor of a vowel segment
average log-energy difference between two vowelsaverage log-energy difference between two vowels
value of pitch jump between two vowelsvalue of pitch jump between two vowels
pause duration between two syllablespause duration between two syllables
The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)
vowel
voweluxx
ˆ
mean of the vowel class of whole corpusuvowel
vowel variance of the vowel class of whole corpus
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 99
Prosodic ModelingProsodic Modeling
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1010
Prosodic Modeling (cont)Prosodic Modeling (cont)
To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1111
Prosodic Modeling (cont)Prosodic Modeling (cont)
For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1212
Prosodic labelingProsodic labeling
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1313
AKI Unseen Handset EstimationAKI Unseen Handset Estimation
The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets
Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch
The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined
where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge
N
nnnhh
1
ˆ
h
1
1
ˆˆ
ˆˆˆˆ
CB
CC
BTB
buAuhu
T
T
μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1
is the bias of the mixture component
A T
b
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1414
AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1515
Eigen-Prosody AnalysisEigen-Prosody Analysis
The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling
Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords
Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix
Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space
Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1616
Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1717
Prosody keyword extractionProsody keyword extraction
After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary
First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed
After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1818
Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics
The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words
The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker
The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors
To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 1919
Eigen-Prosody analysisEigen-Prosody analysis
In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space
Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as
Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m
atricesatrices
TKKKK
T VUAVUA
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2020
Eigen-Prosody Analysis Eigen-Prosody Analysis
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2121
Score measurementScore measurement
The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t
he eigen-prosody speaker space byhe eigen-prosody speaker space by
iKQ
iKT
QiKQ
Pi
KKTQQ
v
vvS
Uy
1
)(
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2222
Speaker Identification ExperimentsSpeaker Identification Experiments
HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances
The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone
In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces
For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech
The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2323
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers
For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built
a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained
367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA
the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3
38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including
Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2424
Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2525
ConclusionsConclusions
This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach
Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets
It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-
2005102420051024 Speech Lab NTNU 2626
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
-