lid/sid - research stay at but last presentation

25
LID/SID - Research Stay at BUT Last Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education Ministry - Spanish Government February 20 th , 2012

Upload: milo

Post on 25-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

LID/SID - Research Stay at BUT Last Presentation. Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo ” fellowship Education Ministry - Spanish Government February 20 th , 2012. Outline. Research stay goals Work on phonotactic LID - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LID/SID - Research Stay at BUT Last Presentation

LID/SID - Research Stay at BUTLast Presentation

Luis Fernando D’HaroPolytechnical University of Madrid

Granted by “José Castillejo” fellowship Education Ministry - Spanish Government

February 20th, 2012

Page 2: LID/SID - Research Stay at BUT Last Presentation

Page 2 of 21Page 2

Outline

Research stay goalsWork on phonotactic LID

Discriminative n-grams New phonotactic system

Using i-vectors and multinomial subspace modelWork on LID-RATS

VAD and LIDFuture work

Page 3: LID/SID - Research Stay at BUT Last Presentation

Page 3 of 21Page 3

Research Stay Goals To work with most recent techniques for LID such as:

i-Vectors, sGMM, WCCN, score calibration/fusion To test our ranking templates and discriminative n-gram

selection approach with the acoustic i-Vector system for LID task

Ideas: Fusion of scores Selection of discriminative n-grams

Collaboration on current BUT campaigns RATS, LRE, SRE

Publications

Page 4: LID/SID - Research Stay at BUT Last Presentation

Page 4 of 21Page 4

Work on Phonotactic LID LID based on ranking positions and distance Original idea: Original idea: [Cavnar and Trenkle, 1994]

Page 5: LID/SID - Research Stay at BUT Last Presentation

Improvements to the Ranking approach One ranking for each n-gram order Golf position

All n-grams with the same number of occurrences share the same position in the ranking

Discriminative positions in the ranking Put in higher positions of the rank the most relevant n-grams for

each language i.e. very frequent in one language but not in the others

A new formula inspired on td-idf providing normalized scores (1, and -1)

Advantages: high order n-grams (up to 5-g) More details at [Caraballo et al, 2010]

Page 5 of 21

Page 6: LID/SID - Research Stay at BUT Last Presentation

Experiments on LRE09 Baseline: phonotactic PCA [Mikolov et al, 2010]

Use soft-counts n-grams for different phone recognizers Our system uses only the normalized score generated by the

system, not the classifier Our baseline classifier based on distance among languages did not work fine

Approaches: Comparison/fusion with the PCA system Fusion with acoustic iVectors system (400 iVectors, 2048 Gauss) Selection of discriminative n-grams

Goal: reduce the input vector of n-gram soft-counts Database:

Train: 9763 segments (345 hours, ~500 utt. per language) Dev: 38134 segments from the 23 languages of LRE09 Test: 41545 segments

Page 6 of 21

Page 7: LID/SID - Research Stay at BUT Last Presentation

Comparison with phonotactic PCA

Baseline approach: Feature vector: Expected N-gram phoneme counts estimated

from lattices For all possible trigrams and most frequent four-grams, e.g.

3-grams: 33^3 = 35 937, (Hungarian phone-ASR) 4-grams: 33^4 = 1 185 921

Then, apply PCA to reduce the vector size (baseline:1000) Discriminative approach

Original templates (up to 4-grams) Engl: 45_2025_100K_200K Russ: 47_2209_100K_200K Hung: 33_1089_35K_200K

Page 7 of 21

Page 8: LID/SID - Research Stay at BUT Last Presentation

Results*

Results*: problems to reproduce the same results reported in the paper

No good results in almost all cases. Big difference in comparison with baseline using only 3-g and PCA.

Page 8 of 21

Cavg30 10 3

Baseline_3g_Hung_PCA1000* 4.58 12.61 25.69

DiscriminativeRanking_Engl 8.89 13.97 26.17

DiscriminativeRanking_Russ 9.59 12.96 24.21

DiscriminativeRanking_Hung 10.73 14.93 26.45

Page 9: LID/SID - Research Stay at BUT Last Presentation

Selection of discriminative n-grams Goal: Help PCA to reduce the size of the feature vector,

by first selecting the most discriminative n-grams and then applying PCA Reducing from 35K to aprox. 8K for 3-grams Using 16K for 4-g instead of 80K most frequents [Mikolov et al,

2010] and concatenating them with the 8K trigrams Selection based on the discriminability among all languages We also try using probabilities instead of vector of counts

Fusion with acoustic i-Vector systems 600 iVectors + 2048 Gaussians Cavg for baseline iVectors:

30s: 2.40% 10s: 4.93% 3s: 14.04%

Page 9 of 21

Page 10: LID/SID - Research Stay at BUT Last Presentation

Results – Disc. Phonotactic System

Page 10 of 21

BASE3G_1KPCA

3gCounts_1KPCA

3gProbs_1KPCA

3g-4gCounts_

1KPCA

3g-4gProbs_1K

PCABase+3gProbs_1KPCA

0

5

10

15

20

25

30

31030

Page 11: LID/SID - Research Stay at BUT Last Presentation

Results – Disc. Phonotactic System + iVectors

Page 11 of 21

BASE3G_1KPCA

3gCounts_1KPCA

3gProbs_1KPCA

3g-4gCounts_

1KPCA

3g-4gProbs_1K

PCABase+3gProbs_1KPCAiVectors

0

2

4

6

8

10

12

14

16

31030

Page 12: LID/SID - Research Stay at BUT Last Presentation

Conclusions phonotactic For LID system based on templates we need to find

better solutions for scoring normalization Discriminative n-gram selection helps both phonotactic

PCA system and iVector system Better results using probabilities instead of counts

because of problems with different length of files ToDo: Test Length Normalization

Find better approach to the selection of high-order n-grams ToDo: use clusters of scores in the discriminative approach to be

able to handle high order n-grams (currently implemented but we did not try it this time)

Page 12 of 21

Page 13: LID/SID - Research Stay at BUT Last Presentation

New Phonotactic system

Baseline: [Soufifar et al, 2011] Use n-gram soft-counts from lattices Use subspace multinomial distributions for estimating iVectors Use iVectors for classifying + using logistic regression (libLinear)

Differences Instead of n-gram soft-counts we use posterior-gram conditional

counts Use original features, or iVectors, or PCA on original features Use Multiclass Logistic Regression + length normalization Results on bigrams and trigrams (no time for fine tunning)

Same training, test and dev sets as for LRE09 Fusion with the acoustic iVector system

Page 13 of 21Page 13

Page 14: LID/SID - Research Stay at BUT Last Presentation

Results new phonotactic iVector

Page 14 of 21

Cavg

30 s Fusion 10 s Fusion 3 s Fusion

Baseline Ivector (600) 2.40 - 4.93 - 14.04 -

2g_Hu1089_originalFeat 5.20 1.66 15.34 3.75 29.61 13.42

2g_Hu_1089toPCA100_MCLR

5.36 1.70 14.12 3.69 27.44 13.18

2g_100iVector_MCLR_LN 5.03 1.55 10.71 3.53 23.74 12.79

Mehdi’s 600 iVectors HU 3.05 8.10 21.39

Trig_600iVector_1089Multi_MultiClassLR_LengthNorm

3.15 1.25 8.66 3.09 21.45 12.15

Page 15: LID/SID - Research Stay at BUT Last Presentation

Work on LID-RATS & VAD-RATS

Goals: Test different noise reduction and speech enhancement

algorithms Test different robust features Test different BUT VADs Combine with iVectors

Database Eight noise conditions + clean data Experiments on the 2 minutes condition and short list Train: 3458 files (115 h) Dev: 7331 files (244 h)

Page 15 of 21

Page 16: LID/SID - Research Stay at BUT Last Presentation

Work on LID-RATS Noise tools and algorithms

Ctucopy, developed at  SpeechLab (FEE CTU - Prague) Extended spectral substraction [Sovka and Pollák. 1996] Spectral substraction with full wave rectification

Using internal and external VAD (i.e. BUT-VAD) Wiener filter [Zavarehei, 2005] QIO Aurora Front-end from OGI [QIO, 2009]

Internal NN_VAD + CMN/CVN + RASTA-LDA + Wiener Filter ETSI: Advanced Front End [ETSI, 2007]

2-pass adaptive Wiener filter + internal VAD (uses energy info from the whole spectrum and F0 regions)

Kalman filter [Murphy, 1998]

Page 16 of 21

Page 17: LID/SID - Research Stay at BUT Last Presentation

Work on LID-RATS Common and new features

MFCC/PLP + Delta and Delta-Delta

PNCC: proposed by [Kim and Stern al, 2010] at CMU

Spectral Delta-Delta: proposed by [Kumar et al, 2011] at CMU

Page 17 of 21

SDC: Shifted Delta Cepstra RPLP: proposed by

[Rajnoha and Pollák, 2011] at SpeechLab at FEE CTU Prague

Hybrid between MFCC + PLP

Tests w/w.o Rasta, VTLN, CMN/CVN Test new positions of the filterbank

After studying the spectogram and noise reduction effects woNR: 300-3200, wNR:500-3000

Page 18: LID/SID - Research Stay at BUT Last Presentation

Work on LID-RATS (120s)

System without Noise Reduction VAD3Baseline: 7MFCC+CMN/CVN+RASTA+7SDC+VTLN 1.60

15 RPLP+CMN/CVN+RASTA+7SDC 1.4915 PNCC+ DeltaDelta + CMN/CVN + RASTA 2.17

Baseline with spectral DD instead of SDC or Delta_Delta 2.48

Page 18 of 21

System with Noise Reduction VAD1Baseline: 7MFCC+CMN/CVN+RASTA+7SDC+VTLN 2.03Base line + Extended Spectral Substraction 2.75

Base line + Spectral substraction with full wave rectification + BUT-VAD 3.31

Base line + Wiener 9.24

Base line + Qio 2.09

Page 19: LID/SID - Research Stay at BUT Last Presentation

Conclusions RATS-LID No any improvement when using de-noising techniques

QIO toolkit provided the best result Important improvements due to correct selection of Low

and High frequency bands RPLP: New robust features for LID PNCC: promising features for LID but training time is

high Spectral Delta-Delta slightly better than traditional delta-

deltas but not than SDC Use of Rasta and CMN/CVN completely necessary for

high performance Short-term CMN/CVN did not provide better results

Page 19 of 21

Page 20: LID/SID - Research Stay at BUT Last Presentation

Future work Discriminative n-grams

New techniques for working with higher n-grams orders Better combination of information from parallel phoneme

recognizers To write a joined paper based on using LRE09

Phonotactic iVector: Promising results Check combination of parallel phone recognizers Incorporation of discriminative information

LRE/SRE Try collaborations on following NIST competitions

Page 20 of 21

Page 21: LID/SID - Research Stay at BUT Last Presentation

Page 21 of 21Page 21

Page 22: LID/SID - Research Stay at BUT Last Presentation

Bibliography I Caraballo, M. A. et al. 2010. "A Discriminative Text Categorization Technique

for Language Identification built into a PPRLM System". FALA, pp. 193- 196. Cavnar, W. B. and Trenkle, J.M . 1994. “N-Gram-Based Text Categorization”.

SDAIR-94, pp. 161-175. ETSI: Advanced Front End V1.1.5. 2007. Available at

http://www.etsi.org/WebSite/Technologies/DistributedSpeechRecognition.aspx

Kim, C. and Stern, R.M. 2010. “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”. ICASSP, pp. 4574 – 4577.

Mikolov et al. 2010. “PCA-based feature extraction for to phonotactic language recognition”. Odyssey, pp. 251-255.

Murphy, K. 1998. “Kalman filter toolbox for Matlab”. Available at http://www.cs.ubc.ca/~murphyk/Software/Kalman/kalman.html

Page 22 of 21

Page 23: LID/SID - Research Stay at BUT Last Presentation

Bibliography II Qualcomm-ICSI-OGI (QIO) Aurora front end. 2009. Available at

ftp://ftp.icsi.berkeley.edu/pub/speech/papers/qio/aurora-front-end/ Rajnoha, J., and Pollák, P. 2011. “ASR systems in Noisy Environment:

Analysis and Solutions for Increasing Noise Robustness”. Radionegineering, Vol. 20, No. 1, April 2011, pp. 74-84.

Soufifar, M. et al. 2011. “iVector approach to phonotactic language recognition”. Interspeech, pp. 2913-2916.

Sovka, P., and Pollák, P. 1996. “Extended spectral subtraction” Eurospeech, pp. 963-966.

Zavarehei, E. 2005. Wiener filter implementation in Matlab. Available at http://www.mathworks.com/matlabcentral/fileexchange/7673-wiener-filter/content/WienerScalart96.m

Page 23 of 21

Page 24: LID/SID - Research Stay at BUT Last Presentation

Results - Discriminative Phonotactic System

Page 24 of 21

Cavg

30 10 3

Phon. +iVec Phon. +iVec Phon. +iVec

Baseline_3g_Hung + PCA(1000)* 4.58 1.54 12.61 3.52 25.69 12.67

Disc3g + PCA (1000) Counts 5.50 1.72 14.31 3.95 27.13 13.68

Disc3g + PCA(1000) Probs 4.83 1.60 10.33 3.69 21.75 12.69

Disc3g + Disc4g + PCA (1000)Counts

4.32 1.48 12.52 3.43 25.83 12.75

Disc3g + Disc4g + PCA(1000) Probs 5.43 1.65 11.50 3.77 22.98 12.80

Fusion: Baseline + 3Disc3g + PCA(1000) Probs

3.48 1.48 8.49 3.49 20.58 12.41

Page 25: LID/SID - Research Stay at BUT Last Presentation

Posterior-gram system

Page 25 of 21