contents and title_rep
DESCRIPTION
Emotional Speech Recognition Abstract and contentsTRANSCRIPT
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of the task would be put
incomplete without the mention of the people who made it possible, whose constant guidance
and encouragement crown all the efforts with success.
I express my heartfelt thanks to Mrs.A.Vijayalakshmi, my project supervisor and
Dr.J.V.R.Ravindra, PG coordinator, Department of Electronics & Communication Engineering,
Vardhaman College of Engineering, for his valuable guidance, and encouragement during this
project work.
I am particularly thankful to Prof.Y.Pandu Rangaiah, Head, Department of Electronics
and Communication Engineering for his intense support and encouragement, which helped me to
mould my project into a successful one.
I express my gratitude to Dr.N.Sambasiva Rao, Principal for having provided all the
facilities and support.
I also thank all the staff of Electronics and Communication Engineering Department for
their valuable support and generous advice.
Finally I thank all my friends and family for their continuous support and enthusiastic
help.
(R.UGANDHAR)
iii
Abstract
A human conveys emotion as well as linguistic information via speech signals. The
emotion in speech makes verbal communications natural, emphasizes a speaker’s intention, and
shows one’s psychological state. During expressive speech, the voice is enriched to convey not
only the intended semantic message but also the emotional state of the speaker. The pitch
contour is one of the important properties of speech that is affected by this emotional
modulation. Although pitch features have been commonly used to recognize emotions, it is not
clear what aspects of the pitch contour are the most emotionally salient. This paper presents an
analysis of the statistics derived from the pitch contour. First, pitch features derived from
emotional speech samples are compared with the ones derived from neutral speech, by using
symmetric Kullback–Leibler distance. Then, the emotionally discriminative power of the pitch
features is quantified by comparing nested logistic regression models. The results indicate that
gross pitch contour statistics such as mean, maximum, minimum, and range are more
emotionally prominent than features describing the pitch shape. Also, analyzing the pitch
statistics at the utterance level is found to be more accurate and robust than analyzing the pitch
statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected
to build a binary emotion detection system for distinguishing between emotional versus neutral
speech.
iv
CONTENTS
Acknowledgement iii
Abstract
iv Contents
v List of Figures
vii
List of Tables viii
1. INTRODUCTION 1
1.1. Literature Survey 2
1.2. Existing System 3
1.3. Proposed System Approach 4
2. SPEECH PROCESSING-BASIC TERMINOLOGY 7
2.1. Statistical Model of Audio Signal Representation 7
2.1.1. The Short-Time Fourier Transform 9
2.1.2. Basic Terminology in Speech 10
2.2. Speech Production 12
2.2.1. The Human Voice 12
2.2.2. Factors Associated With Speech 14
2.2.3. 2.2.3. Special Type of Voiced and Unvoiced Sounds 17
2.3. Pitch 19
2.3.1. Speech Production and Various Properties of Speech Signals 19
2.3.2. How Does Pitch Relate To Speech Production 20
2.4. Pitch Estimation 21
2.5. Global Parameters of F0 Contour 22
2.6. Stressed Syllables of Accentuation 22
2.7. Syllable Based F0 Contours 23
2.7.1. Direction of F0 Contour 23
2.7.2. Steepness of Rising and Falling F0 23
2.7.3. Progression of F0 with Respect to Accentuation 23
2.7.4. F0-Contours of Sentences 24v
3. SPEECH DATABASES AND DIFFERENT PITCH FEATURES 28
3.1. Databases 28
3.1.1. Non-Emotional Corpus 28
3.1.2. Emotional Corpus 29
3.2. Normalization 31
3.2.1. Energy Normalization 31
3.2.2. Pitch Normalization 31
3.3. Statistical Analysis of Pitch Features 32
4. METHODOLOGY 36
4.1. Discriminant Analysis Using Symmetric Kullback-Liebler Distance 39
4.2. Logistic Regression Analysis 41
4.2.1. Regression with Single Variable 43
4.2.2. Regression with multi variables 44
4.3. Analysis of Pitch Features 44
4.3.1. Emotion Classification Techniques 45
4.3.2. Classification Techniques that Employ Prosody Features 46
4.3.3. Classification Techniques that Employ Statistics of Prosody Features 46
5. SIMULATION RESULTS 47
6. CONCLUSION AND FUTURE SCOPE 51
7. REFERENCES 52
APPENDIX 53
Software Requirement 53
I. MATLAB 53
II. PRAAT Speech Processing Software 69
vi
LIST OF FIGURES
Figure 2.1(a) Time Domain Representation of an Audio Signal 8
Figure 2.1(b) Representation of Audio Signal in Time/Frequency Domain 8
Figure 2.1(c) Representation of Audio Signal in Frequency Domain 8
Figure 2.2 A Schematic View of Human Speech Production 12
Figure 2.3 Block Diagram of Human Speech Production System 13
Figure 2.4 Spectrums of Voiced and Unvoiced Sounds 15
Figure 2.5 Physiological Components of Human Speech Production System 19
Figure 2.6 F0 contours of Emotional and neutral utterances 25
Figure 2.6 F0 contours of Emotional and neutral utterances 26
Figure 2.6 F0 contours of Emotional and neutral utterances 27
vii
LIST OF TABLES
Table 3.1 Summary of Databases 29
Table 3.2 Sentence and Voice Level Features Extracted from F0 33
Table 3.3 Additional Sentence Level Features 35
Table 4.1 Discriminant Classifiers for Emotion Recognition 46
viii
ix