introducing attribute features to foreign accent …...introducing attribute features to foreign...

Introducing Attribute Features to Foreign Accent Recognition

Hamid Behravan, Ville Hautamäki, Sabato Marco Siniscalchi, Tomi Kinnunen and

Chin-Hui Lee

School of Computing, University of Eastern Finland

Faculty of Architecture and Engineering, University of Enna ”Kore”, Italy School of ECE, Georgia Institute of Technology, USA

Gender

Language

Content

Dialect

Emotion

Age Accent

Personality

What Is Accent And Foreign Accent?

Accent refers to different ways of speaking a language. Foreign spoken accents are caused by the influence of one's first language on the second one.

Approaches in Accent Recognition Systems

Phonotactic based modeling : Based on phonemes and phone

distributions

Acoustic-based modeling : Based on spectral characteristics

of the audio signals

Speech Signal

Feature Extraction

Phone Recognizer

N-gram Statistics and

Modeling Scoring

Transcription dictionary

Speech Signal

Feature Extraction

Time Differences

Gaussian Mixture

Modeling Scoring

Dialect/Accent labels

Training Phase

Decision

Universal Attribute Characterization

Manner of articulation describes how the tongue, lips, jaw, and other speech organs are involved in making a sound make contact: Nasal: /m/ , /n/ Stop: /p/, /b/, /k/ Fricative: /f/, /v/, /θ/ Vowel: /a/, /e/, /i/ Approximant: /w/, /r/ Glide: /j/ And

Voicing: Voicing refers to the presence or absence of vocal fold vibration.

Place of articulation specifies the position at which a constriction in the vocal tract occurs.

Example : /d/ Place of articulation = alveolar Manner of articulation = stop Voicing = voiced (The vocal folds are vibrating.)

- Language universal descriptors

- Statistics of their co-occurrences change from one language to another

- Only a small number of universal attributes completely characterize any spoken languages (Siniscalchi et al. 2013)

- Universal attribute detectors can be designed by sharing data among different languages

Why Speech Attributes?

How Attributes Are Extracted From Audio Signals?

The internal structure of an attribute detector

For each frame (f), we will have 3 posterior probabilities: P (model | f); P (anti-model | f) and P (noise | f).

.

.

Feature vector

Example: Finnish speaker

Example: Indian speaker

.

.

Attribute feature vector

S = m + T w

I-vector Approach

w Cosine Scoring

S : Utterance dependent GMM supervector m : UBM mean supervector T : Total variability matrix w : i-vector

S

T W

m

Signal

Finnish National Foreign Language Certificate Corpus

Accents #Speakers #Train #Test

Spanish 15 60 25

Albanian 19 67 30

Kurdish 21 83 35

Turkish 22 84 34

English 23 92 37

Estonian 28 153 63

Arabic 42 166 67

Russian 235 599 211

Language is Finnish The accent is English

Language is Finnish The accent is Estonian

Two Examples From FSD Corpus

Features (dimensionality)

Avg_EER (%)

SDC+MFCC (56) 15.00 7.00

Attributes (18) 12.54 5.07

Attribute+∆ (36) 11.33 4.79

Attribute+∆+∆∆ (54)

11.00 4.59

: Average detection cost from NIST evaluation metric in LRE 2009

Attribute Features Outperform Spectral Baseline System

Temporal Context Capturing Using PCA

. . . . . . . . .

d = 18 * C C = 5, 20, 30

. PCA dimensionality

reduction

d is set to retain 99 % of the cumulative variance in PCA. d = 23, 50 , 96

PCA features Average EER

(C = 1, d = 18) 12.54 5.07

(C = 5; d = 23) 10.65 4.82

(C = 20; d = 50) 10.44 4.71

(C = 30; d = 96)

8.73 4.47

Increasing The Context size Improves The System Performance

Baseline Attribute

Accents EER (%) C_DET *100

Turkish 3.82 2.01

Albanian 4.34 2.48

Arabic 7.46 4.04

English 8.11 4.20

Kurdish 8.57 4.67

Spanish 9.00 4.10

Estonian 12.70 6.11

Russian 15.54 8.17

Large Variations in EER (%)

Some Attributes Are Individually Important whereas Some Are not

Features (Dim.) Avg_EER

SDC+MFCC 13.82 7.87

Manner (18) 11.09 6.18

Place (27)

12.00 7.27

Attributes Do Also Well in NIST08 English Data

Conclusion

We Introduced the universal attribute units to foreign accent recognition task. We modeled the sequence of speech attribute feature vectors using the i-vector methodology. Adding context information improves the recognition results from 12.54% to 8.73% considering only Avg_EER.

Foreign Accent Detection Android App.

Cosine Scoring

introducing attribute features to foreign accent …...introducing attribute features to foreign...

Documents