![Page 1: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/1.jpg)
The TUM Cumulative DTW Approach for the Spoken Web
Search Task
Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller
Institute for Human-Machine CommunicationTechnische Universität München
![Page 2: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/2.jpg)
Summary
• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative
training into DTW
Mediaeval 2012 Workshop 2
![Page 3: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/3.jpg)
Cumulative DTW (CDTW)
• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning
• Idea: – Use different local cost functions for each step – Automatic learning of these functions as
combination of general features
Mediaeval 2012 Workshop 3
![Page 4: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/4.jpg)
From DTW to CDTW
• Local cost function:
Mediaeval 2012 Workshop 4
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝛼1
𝛼2
𝛼3
• Dynamic Programming:
![Page 5: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/5.jpg)
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 5
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
![Page 6: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/6.jpg)
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 6
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
![Page 7: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/7.jpg)
Softmax?
+ Differentiable– Allow for an optimization of the
+ Combine several alignment paths– More robust to local changes
- Only give a score (not the optimal path)
Mediaeval 2012 Workshop 7
![Page 8: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/8.jpg)
Features
• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization
• Features (k=1…D):– Local distance
– “Local self-similarity”
; – Distance / product of the self-similarities
Mediaeval 2012 Workshop 8
![Page 9: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/9.jpg)
Decision
• Are the two sequences instances of the same word/expression?
• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set
Mediaeval 2012 Workshop 9
Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽
![Page 10: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/10.jpg)
Search Procedure
Given query and utterance
1) Feature extraction
2) Candidate search in
3) CDTW comparison
4) Score post-processing
Mediaeval 2012 Workshop 10
![Page 11: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/11.jpg)
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 11
![Page 12: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/12.jpg)
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 12
![Page 13: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/13.jpg)
CDTW Score Post-Processing
• Same decision function as for learning– Many false positives– Bias toward some queries
• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW
scores for that query
Mediaeval 2012 Workshop 13
![Page 14: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/14.jpg)
Results
run devQ-devC evalQ-devC devQ-evalC evalQ-evalC
P(miss) 55.6% 59.5% 60.2% 54.5%
P(FA) 1.18% 1.13% 1.17% 1.13%
ATWV 0.263 0.333 0.164 0.290
Mediaeval 2012 Workshop 14
• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC
• ATWV scores depend on the run
![Page 15: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/15.jpg)
Results
• DET curves similar
• CDTW seems to generalize well
• Decision function has to be improved
Mediaeval 2012 Workshop 15
![Page 16: The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task](https://reader033.vdocuments.us/reader033/viewer/2022061213/549843b1b47959744d8b54a3/html5/thumbnails/16.jpg)
Conclusion
• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization
• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search
Mediaeval 2012 Workshop 16