july 2011 1 age and gender recognition from speech patterns based on supervised non-negative matrix...
Post on 13-Dec-2015
215 Views
Preview:
TRANSCRIPT
July 2011 1
Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization
Mohamad Hasan Bahari
Hugo Van hamme
2
Outline
Introduction and Motivations
Age and Gender Recognition
Corpora
Supervised Non-negative Matrix Factorization
Proposed Method
Results
Conclusions and Future Researches
3
Introduction
Confirming the identity of individuals
Biometric Characteristics Fingerprint
Face
Iris
Hand Geometry
Ear Shape
Voice pattern
…
Choosing a characteristic Availability
Reliability
4
Motivation
In many real world cases, only speech patterns are available (kidnapping, threatening calls, …)
Speech patterns can include many interesting information
Gender
Age
Dialect (original or previous regions)
Membership of a particular social group
…
To facilitates in identifying a criminal
To narrow down the number of suspects
Goal
5
Goal:
To extract different physical and psychological characteristics of the speaker from his/her voice patterns (Speaker Profiling).
Physical:
1. Gender
2. Age
3. Accent
4. …
Psychological:
1. Anxiousness
2. Stress
3. Confidence
4. …
Age and Gender Recognition
6
Three approaches:
I. Directly from speech signal.
II. Modeling the speech generation system.
III. Modeling the hearing system.
7
I. Directly from speech signal. Different acoustic features vary with age.
1) Fundamental frequency
2) Speech rate
3) Sound pressure level
4) …
By Finding all acoustic features varying with age and their exact relation to the speaker age.
Conceptually simple and computationally inexpensive
x These features are affected by many other parameters, such as weight, height, voice quality, emotional condition, …
Age and Gender Recognition
8
Effect of Age and Gender on speech (Fundamental frequency) [1]
Age and Gender Recognition
[1] W. S. Brown, R. J. Morris, H. Hollien, and E. Howell, Journal of Voice, vol. 5, pp. 310–315, 1991.
Age is only one of inputs affecting the speech and consequently acoustic features.
It is impossible to estimate the age without considering the rest of inputs
Perceptions of gender and age have a significant mutual impact on each other.
9
II. Modeling the speech generation system. It is an input estimation problem.
x Modeling the speech generation system of the speaker is very difficult.
Age and Gender Recognition
10
Age and Gender Recognition
III. Modeling the hearing system
To solve the speech recognition problem, the hearing system is modeled using Hidden Markove Models (HMMs).
Using the tools applied in speech recognition problems (HMMs) .
Well established.
Accurate in recognizing content.
x There exist a difference between the age of a speaker as perceived, and their actual age.
x Computationally complex
11
Corpora
Category NameYoung Male
Young Female
Middle Male
Middle Female
Senior Male
Senior Female
Age 18-35 18-35 36-45 36-45 46-81 46-81
Number of Speakers 85 53 160 41 191 25
555 speakers from the N-best evaluation corpus [1]
The corpus contains live and read commentaries, news, interviews, and reports broadcast in Belgium
Different age groups and genders
[1] D. A. Van Leeuwen, J. Kessens, E. Sanders, and H. van den Heuvel, In proc. Interspeech, pp. 2571-2574, 2009.
SNMF
12
Non-negative Matrix Factorization (NMF) is a popular machine learning algorithm [1]
It is used in supervised or unsupervised modes.
Supervised NMF or SNMF is a pattern recognition method [1] It is very effective in the case of high dimension input space. It is a generative classifier. It can directly classify patterns into multiple classes (no need to
change the problem into multiple binary classification).
[1] H. Van hamme, In proc. Interspeech, Australia, pp. 2554-2557, 2008.
13
Problem Statement:
Given a training data-set: Str= {(x1, y1), . . ., (xn, yn), . . . , (xN, yN)}
xn is a vector of observed characteristics for the data item
yn denotes a label vector which represents the class that xn belongs to
Goal:
Approximation of a classifier function (g), such that ŷ=g(xtst) is as
close as possible to the true label.
xtst is an unseen observation
SNMF
SNMF
SNMF in Training Phase:First step: Second step:
Extended Kullbeck-Leibler divergence:
Multiplicative updating formula:
14
Ntr
B
NtrS
xxV
yyV
1
1
tr
trB
trS
trB
trStrtrtr H
W
W
V
VHWV
znzn
tr
mn
trmnmn
trtr
mntrtr
trmntr
mntrtrtr
KL HVHWHW
VVHWVD log
trtr
trTtr
NMTtr
trtr
Ttr
trtr
tr
TtrNM
trtr
HW
VW
W
HH
HHW
V
H
WW
)(1)(
)()(1
trB
trStr
V
VV
SNMF
SNMF in Testing Phase:
First step: Second step:
Extended Kullbeck-Leibler divergence:
Multiplicative updating formula:
15
tsttrB
tstKL
H
trS
tst HWxDWxgytst
minarg)(ˆ
tsttrB
tst HWx tsttrS
tst HWxgy )(ˆ
zz
tst
m
tstmm
tsttrB
mtsttr
B
tstmtst
mtsttr
Bstt
KL HxHWHW
xxHWxD log
tsttr
B
tstTtr
B
MTtr
B
tsttst
HW
xW
W
HH )(
1)( 1
Proposed Method
16
1. Feature selection
2. Acoustic modeling
3. Supervector making procedure
4. Training phase
5. Testing phase
Proposed Method
17
1. Feature selection• MEL Spectra
• Mean normalization
• vocal tract length normalization
• Augmented with their first and second order time derivatives.
Speech Signal
Feature selection
Feature Vectors
….
Proposed Method
18
2. Acoustic modeling
Speaker independent Model:
• An HMM with a shared pool of 49740 Gaussians to model the observations in 3873 cross-word context-dependent tied triphone states.
Adaptation Method:
• The speaker dependent mixture weights for each speaker result from a re-estimation of the speaker independent weights based on a forced alignment of the training data for that speaker using a speaker-independent acoustic model.
The result of this step is 555 speaker adapted models
Speaker Independent
Model
Speaker Adaptation
Method
Model of the
Speaker
Proposed Method
19
3. Supervector making procedure
Gaussian Mixture Model (GMM) of each speaker adapted HMMs is:
Three type of supervectors:
1. Means
2. Variances
3. Weights
Weights supervectors:
The result of this step is 555 supervectors for each of 555 speakers
),,()(1
sj
sjt
J
j
sjt owo
sf
s
TTSTsT
n
TsQ
sq
sss wwwfr
)()()( 1
1
Proposed Method
20
4. Training phase
5. Testing phase
Results
21
Evaluation Methodology 5-fold cross-validation (five independent run)
In each of five run: Training set is speech data of 444 speakers
Testing set is speech data of 111 speakers
TST TR TR TR TR
Database
TR TST TR TR TR
Database
.
.
.
Run 1
Run 2
Results
22
Gender recognition is 96%.
relative confusion matrix
Age group recognition
CLAC
YM YF MM MF SM SF
YM 13 03 58 0 26 0YF 02 77 04 11 057 0MM 06 01 44 01 47 0
MF 0 54 02 24 17 02SM 03 01 19 0 76 0SF 0 2 08 28 28 16
Category Name Young Male Young
Female Middle Male Middle Female Senior Male Senior Female
Prior 15 10 29 7 34 4Accuracy 13 77 44 24 76 16
Conclusions and Future Researches
23
Conclusions:
1. A new age-gender recognition method based on SNMF
2. Supervectors of GMM weights were used
3. Evaluated on N-Best Corpus
4. Gender recognition accuracy is 96%
5. Age group recognition accuracy is significantly higher than chance level
Future Researches:
1. Age estimation instead of age group recognition.
2. Using supervectors of GMM means and variances and combining these features
Thank You for Your Attention
24
top related