sebastien coquoz, swiss federal institute of technology (epfl)
DESCRIPTION
Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition. Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL) International Computer Science Institute (ICSI). March 16, 2004. Outline. General Idea ASR System used Exploratory work - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/1.jpg)
1
Broadcast News Segmentation using Metadata and Speech-To-Text
Informationto Improve Speech Recognition
Sebastien Coquoz,
Swiss Federal Institute of Technology (EPFL)
International Computer Science Institute (ICSI)
March 16, 2004
![Page 2: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/2.jpg)
2
Outline
General Idea ASR System used Exploratory work Strategies Results Conclusion
![Page 3: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/3.jpg)
3
General idea
Use Metadata (SUs) and Speech-To-Text (STT) information to improve later STT passes (feedback loop)
![Page 4: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/4.jpg)
4
Why segmentation?Why segment the audio stream?
• Important to give « linguistically coherent » pieces to the language model
• Remove « non-speech » (i.e. long silences, laughs, music, other noises,…)
Why use MDE?
• MDE gives information about sentence and speaker breaks
• Speaker labels improve the efficiency of the acoustic model and sentences improve the efficiency of the language model
• BBN’s error analysis of Broadcast News recognition revealed a higher error rate at segments boundaries
this may be caused by missing the true sentence boundaries
![Page 5: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/5.jpg)
5
Metadata and STT information
MDE object used:
Sentence-like units (SUs): express a thought or
idea. It generally corresponds to a sentence. Each SU has a
confidence measure, timing information (starting point and
duration) and a cluster label.
STT object used:
Lexemes: describe the words that were assumed to
be uttered. Each word has timing information (beginning
and duration).
![Page 6: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/6.jpg)
6
ASR system used
The system used is a simplified SRI BN evaluation system.
Recognition steps:
1. Segment the waveforms
2. Cluster the segments into « pseudo-speakers »
3. Compute and normalize features (Mel cepstrum)
4. Do first pass recognition with non-crossword acoustic models and bigram language model
5. Generate lattices
6. Expand lattices using 5-gram language model
7. Adapt acoustic models for each « pseudo-speaker »
8. Generate new lattices using the adapted acoustic models
9. Expand new lattices using 5-gram language model
10. Score the resulting hypotheses
![Page 7: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/7.jpg)
7
Types of segmentation
Baseline
• Classifies frames into « speech » and « non-speech » using a 2-state HMM
• Uses inter-words silences and speaker turns to segment the BN shows
MDE-based
• Uses sentence and speaker breaks to define an initial segmentation
• Further processes the segments using different strategies presented later
Baseline vs. MDE-based segmentation
![Page 8: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/8.jpg)
8
Baseline experiments
Comments:
• The baseline segmentation is the one presented above
• The results (shown later) obtained are:
• the current best results
• the baselines that ultimately have to be improved
• No additional processing step is applied to modify the segments
![Page 9: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/9.jpg)
9
« Cheating » experiments (1)
Why?
• See if there is room for improvement when using MDE-based segmentation
How?
• Use transcripts written by humans to segment the Broadcast News audio stream and apply processing strategies to improve recognition (i.e. use true information)
![Page 10: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/10.jpg)
10
« Cheating » experiments (2)
Results: Baseline vs. « Cheating » experiments
WERBaseline
seg Cheating
seg(using SU)
Cheating seg
(SU+proc)
Wtd avg on 6
shows14.0 14.2 13.0
There is room for improvement!
![Page 11: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/11.jpg)
11
Overview of the processing steps
Broadcast News Shows
0. Segmentation using SUs
1. First strategy: splitting of long segments
2. Second strategy: concatenation of short segments
3. Third strategy: addition of time pads
Final segmentation
![Page 12: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/12.jpg)
12
First strategy: splitting of long segments
Why?
• Too long segments may cover more than 1 sentence confusing for the language model
How?
• Use automatically generated transcripts and MDE
• Too short segments mustn’t be processed bad for the efficiency of the language model
• Take two features into account for decision tree:
• The duration of segments
• The pause between words
![Page 13: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/13.jpg)
13
Second strategy: concatenation of short segments
Why?
• Short segments are not optimal for the language model
• Short segments increase the WER because all their words are close to the boundaries (cf. BBN’s error analysis)
How?
• Take 3 features into account for decision tree:
• Pause between segments
• Sum of the duration of two neighbors
• Cluster label
![Page 14: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/14.jpg)
14
Third strategy: Addition of time pads
Why?
• Prevent words from only being partially included
• Because the windowing in the front end has a scope of up to 8 frames (4 on each side) better to have enough padding
How?
• Take 1 feature into account for decision tree:
• The pause between segments
![Page 15: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/15.jpg)
15
Examples of improvements (1)
1) Real sentence: … and strictly limits state authority over how and when water is used …
Recognized sentence:
With baseline segmentation (cuts in middle of sentence):
… and stricter limits data arty over how and when watery hues …
Legend: segmentation point
red errors
time
time
With MDE-based segmentation:… and strict_ limits state authority over how and when water issues …
time
![Page 16: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/16.jpg)
16
Examples of improvements (2)
2) Real sentence: … I didn’t know if we would pull off the games. I didn’t know if this community
would ever rally around the Olympics again. …
Recognized sentence:
With baseline segmentation (doesn’t cut at end of
sentence):
… pull off the games that had not this community would ever rally around …
time
time
With MDE-based segmentation:
… pull off the game_ I didn’t know _ this community would ever rally around …
time
![Page 17: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/17.jpg)
17
Results for the development set
WER Baseline seg
Step 0: SU seg
SU seg + step1
Wtd avg on 6
shows
14.0 14.4 14.2
SU seg + steps 1 & 2
SU seg + steps 1 & 2
& 3
14.0 13.3
The improvement is 0.7% absolute and 5% relative!
![Page 18: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/18.jpg)
18
Results for the evaluation set
WER Baseline seg
Step 0: SU seg
SU seg + step1
Wtd avg on 6
shows
18.7 19.8 19.7
SU seg + steps 1 & 2
SU seg + steps 1 & 2
& 3
19.6 18.4
The improvement is 0.3% absolute and 1.6% relative!
![Page 19: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/19.jpg)
19
Dev results vs. Eval results
Observations:
• No « cheating » information available for the eval not sure how well the SU detection is working
• Improvements from step 0 (SU segmentation) to final segmentation are similar for dev set and eval set: 1.1% absolute (7.6% relative) for dev set and 1.3% absolute (6.6% relative) for eval set SU information not optimized for eval
• Respective improvements are quite uneven for each show suggests that the strategies are show dependent, not channel dependent
![Page 20: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/20.jpg)
20
Future work
• Further optimize the thresholds for the three strategies
• Find a representation to choose a specific value of the thresholds for each show individually (i.e. fully adapted the decision trees to each show)
• Use Metadata objects such as the confidence measure of each SU and diarization to further improve the strategies
![Page 21: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/21.jpg)
21
Conclusion
• Development of a new segmentation method based on Metadata and Speech-To-Text information
• Use features given by MDE and STT information in decision trees for each processing step
• Results indicate the promiss of this approach
• Further developments still seem to have room for improvement
![Page 22: Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)](https://reader036.vdocuments.us/reader036/viewer/2022062323/568157c2550346895dc54478/html5/thumbnails/22.jpg)
22
Acknowlegments
I would like to thank:• Prof. Bourlard & Prof. Morgan
• Barbara & Andreas
• Yang
• IM2 for supporting my experience