introducing attribute features to foreign accent …...introducing attribute features to foreign...
TRANSCRIPT
Introducing Attribute Features to Foreign Accent Recognition
Hamid Behravan, Ville Hautamäki, Sabato Marco Siniscalchi, Tomi Kinnunen and
Chin-Hui Lee
School of Computing, University of Eastern Finland
Faculty of Architecture and Engineering, University of Enna ”Kore”, Italy School of ECE, Georgia Institute of Technology, USA
Gender
Language
Content
Dialect
Emotion
Age Accent
Personality
What Is Accent And Foreign Accent?
Accent refers to different ways of speaking a language. Foreign spoken accents are caused by the influence of one's first language on the second one.
Approaches in Accent Recognition Systems
Phonotactic based modeling : Based on phonemes and phone
distributions
Acoustic-based modeling : Based on spectral characteristics
of the audio signals
Speech Signal
Feature Extraction
Phone Recognizer
N-gram Statistics and
Modeling Scoring
Transcription dictionary
Speech Signal
Feature Extraction
Time Differences
Gaussian Mixture
Modeling Scoring
Dialect/Accent labels
Training Phase
Decision
Universal Attribute Characterization
Manner of articulation describes how the tongue, lips, jaw, and other speech organs are involved in making a sound make contact: Nasal: /m/ , /n/ Stop: /p/, /b/, /k/ Fricative: /f/, /v/, /θ/ Vowel: /a/, /e/, /i/ Approximant: /w/, /r/ Glide: /j/ And
Voicing: Voicing refers to the presence or absence of vocal fold vibration.
Place of articulation specifies the position at which a constriction in the vocal tract occurs.
Example : /d/ Place of articulation = alveolar Manner of articulation = stop Voicing = voiced (The vocal folds are vibrating.)
- Language universal descriptors
- Statistics of their co-occurrences change from one language to another
- Only a small number of universal attributes completely characterize any spoken languages (Siniscalchi et al. 2013)
- Universal attribute detectors can be designed by sharing data among different languages
Why Speech Attributes?
How Attributes Are Extracted From Audio Signals?
The internal structure of an attribute detector
For each frame (f), we will have 3 posterior probabilities: P (model | f); P (anti-model | f) and P (noise | f).
.
.
Feature vector
Example: Finnish speaker
Example: Indian speaker
.
.
Attribute feature vector
S = m + T w
I-vector Approach
w Cosine Scoring
S : Utterance dependent GMM supervector m : UBM mean supervector T : Total variability matrix w : i-vector
S
T W
m
Signal
Finnish National Foreign Language Certificate Corpus
Accents #Speakers #Train #Test
Spanish 15 60 25
Albanian 19 67 30
Kurdish 21 83 35
Turkish 22 84 34
English 23 92 37
Estonian 28 153 63
Arabic 42 166 67
Russian 235 599 211
Language is Finnish The accent is English
Language is Finnish The accent is Estonian
Two Examples From FSD Corpus
Features (dimensionality)
Avg_EER (%)
SDC+MFCC (56) 15.00 7.00
Attributes (18) 12.54 5.07
Attribute+∆ (36) 11.33 4.79
Attribute+∆+∆∆ (54)
11.00 4.59
: Average detection cost from NIST evaluation metric in LRE 2009
Attribute Features Outperform Spectral Baseline System
Temporal Context Capturing Using PCA
. . . . . . . . .
d = 18 * C C = 5, 20, 30
. PCA dimensionality
reduction
d is set to retain 99 % of the cumulative variance in PCA. d = 23, 50 , 96
PCA features Average EER
(C = 1, d = 18) 12.54 5.07
(C = 5; d = 23) 10.65 4.82
(C = 20; d = 50) 10.44 4.71
(C = 30; d = 96)
8.73 4.47
Increasing The Context size Improves The System Performance
Baseline Attribute
Accents EER (%) C_DET *100
Turkish 3.82 2.01
Albanian 4.34 2.48
Arabic 7.46 4.04
English 8.11 4.20
Kurdish 8.57 4.67
Spanish 9.00 4.10
Estonian 12.70 6.11
Russian 15.54 8.17
Large Variations in EER (%)
Some Attributes Are Individually Important whereas Some Are not
Features (Dim.) Avg_EER
SDC+MFCC 13.82 7.87
Manner (18) 11.09 6.18
Place (27)
12.00 7.27
Attributes Do Also Well in NIST08 English Data
Conclusion
We Introduced the universal attribute units to foreign accent recognition task. We modeled the sequence of speech attribute feature vectors using the i-vector methodology. Adding context information improves the recognition results from 12.54% to 8.73% considering only Avg_EER.
Foreign Accent Detection Android App.
Cosine Scoring