contents and title_rep

ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of the task would be put

incomplete without the mention of the people who made it possible, whose constant guidance

and encouragement crown all the efforts with success.

I express my heartfelt thanks to Mrs.A.Vijayalakshmi, my project supervisor and

Dr.J.V.R.Ravindra, PG coordinator, Department of Electronics & Communication Engineering,

Vardhaman College of Engineering, for his valuable guidance, and encouragement during this

project work.

I am particularly thankful to Prof.Y.Pandu Rangaiah, Head, Department of Electronics

and Communication Engineering for his intense support and encouragement, which helped me to

mould my project into a successful one.

I express my gratitude to Dr.N.Sambasiva Rao, Principal for having provided all the

facilities and support.

I also thank all the staff of Electronics and Communication Engineering Department for

their valuable support and generous advice.

Finally I thank all my friends and family for their continuous support and enthusiastic

help.

(R.UGANDHAR)

iii

Abstract

A human conveys emotion as well as linguistic information via speech signals. The

emotion in speech makes verbal communications natural, emphasizes a speaker’s intention, and

shows one’s psychological state. During expressive speech, the voice is enriched to convey not

only the intended semantic message but also the emotional state of the speaker. The pitch

contour is one of the important properties of speech that is affected by this emotional

modulation. Although pitch features have been commonly used to recognize emotions, it is not

clear what aspects of the pitch contour are the most emotionally salient. This paper presents an

analysis of the statistics derived from the pitch contour. First, pitch features derived from

emotional speech samples are compared with the ones derived from neutral speech, by using

symmetric Kullback–Leibler distance. Then, the emotionally discriminative power of the pitch

features is quantified by comparing nested logistic regression models. The results indicate that

gross pitch contour statistics such as mean, maximum, minimum, and range are more

emotionally prominent than features describing the pitch shape. Also, analyzing the pitch

statistics at the utterance level is found to be more accurate and robust than analyzing the pitch

statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected

to build a binary emotion detection system for distinguishing between emotional versus neutral

speech.

iv

CONTENTS

Acknowledgement iii

Abstract

iv Contents

v List of Figures

vii

List of Tables viii

1. INTRODUCTION 1

1.1. Literature Survey 2

1.2. Existing System 3

1.3. Proposed System Approach 4

2. SPEECH PROCESSING-BASIC TERMINOLOGY 7

2.1. Statistical Model of Audio Signal Representation 7

2.1.1. The Short-Time Fourier Transform 9

2.1.2. Basic Terminology in Speech 10

2.2. Speech Production 12

2.2.1. The Human Voice 12

2.2.2. Factors Associated With Speech 14

2.2.3. 2.2.3. Special Type of Voiced and Unvoiced Sounds 17

2.3. Pitch 19

2.3.1. Speech Production and Various Properties of Speech Signals 19

2.3.2. How Does Pitch Relate To Speech Production 20

2.4. Pitch Estimation 21

2.5. Global Parameters of F0 Contour 22

2.6. Stressed Syllables of Accentuation 22

2.7. Syllable Based F0 Contours 23

2.7.1. Direction of F0 Contour 23

2.7.2. Steepness of Rising and Falling F0 23

2.7.3. Progression of F0 with Respect to Accentuation 23

2.7.4. F0-Contours of Sentences 24v

3. SPEECH DATABASES AND DIFFERENT PITCH FEATURES 28

3.1. Databases 28

3.1.1. Non-Emotional Corpus 28

3.1.2. Emotional Corpus 29

3.2. Normalization 31

3.2.1. Energy Normalization 31

3.2.2. Pitch Normalization 31

3.3. Statistical Analysis of Pitch Features 32

4. METHODOLOGY 36

4.1. Discriminant Analysis Using Symmetric Kullback-Liebler Distance 39

4.2. Logistic Regression Analysis 41

4.2.1. Regression with Single Variable 43

4.2.2. Regression with multi variables 44

4.3. Analysis of Pitch Features 44

4.3.1. Emotion Classification Techniques 45

4.3.2. Classification Techniques that Employ Prosody Features 46

4.3.3. Classification Techniques that Employ Statistics of Prosody Features 46

5. SIMULATION RESULTS 47

6. CONCLUSION AND FUTURE SCOPE 51

7. REFERENCES 52

APPENDIX 53

Software Requirement 53

I. MATLAB 53

II. PRAAT Speech Processing Software 69

vi

LIST OF FIGURES

Figure 2.1(a) Time Domain Representation of an Audio Signal 8

Figure 2.1(b) Representation of Audio Signal in Time/Frequency Domain 8

Figure 2.1(c) Representation of Audio Signal in Frequency Domain 8

Figure 2.2 A Schematic View of Human Speech Production 12

Figure 2.3 Block Diagram of Human Speech Production System 13

Figure 2.4 Spectrums of Voiced and Unvoiced Sounds 15

Figure 2.5 Physiological Components of Human Speech Production System 19

Figure 2.6 F0 contours of Emotional and neutral utterances 25



vii

LIST OF TABLES

Table 3.1 Summary of Databases 29

Table 3.2 Sentence and Voice Level Features Extracted from F0 33

Table 3.3 Additional Sentence Level Features 35

Table 4.1 Discriminant Classifiers for Emotion Recognition 46

viii

contents and title_rep

Documents