![Page 1: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/1.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 1
Speech and Language Speech and Language Technologies for Audio Technologies for Audio Indexing and RetrievalIndexing and Retrieval
JOHN MAKHOUL, FELLOW, IEEE,
FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG
NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE
PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000
![Page 2: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/2.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 2
OutlineOutline
Introduction Indexing and Browsing with Rough’n’Ready
Rough’n’Ready System Indexing and Browsing
Statistical Modeling Paradigm Speech Recognition Speaker Recognition
Segmentation Clustering Identification
![Page 3: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/3.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 3
IntroductionIntroduction Much of information will be in the form of speech from
various source.
It’s now possible to start building automatic content-based indexing and retrieval tools.
The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing.
The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval.
![Page 4: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/4.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 4
Rough’n’Ready systemRough’n’Ready system
ActiveX controls
MP3
Dual P733-MHz
Collect/Manage Archive
Interact with browser
ActiveX controls
![Page 5: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/5.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 5
Indexing and BrowsingIndexing and Browsing
![Page 6: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/6.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 6
Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)
Speaker
People
Place
Organization
Topic Labels
![Page 7: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/7.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 7
Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)
Selected from over 5500 topic labels
![Page 8: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/8.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 8
Statistic Modeling ParadigmStatistic Modeling Paradigm
Maximize P(output|input, model)
(desired recognized sequence of the data)
![Page 9: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/9.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 9
Speech RecognitionSpeech Recognition Statistic model: acoustic models, language
models
Acoustic model Describe the time-varying evolution of feature vectors
for each sound or phoneme Employ hidden Markov models (HMM) Gaussian mixture models the feature vector for each
HMM states Special acoustic models for nonspeech events: music,
silence/noise, laughter, breath, and lip-smack.
Language model: N-gram language model
![Page 10: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/10.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 10
Speech Recognition (Cont’d)Speech Recognition (Cont’d) Multipass recognition search strategy
Fast-match pass Narrows search space Followed by other passes with more accurate models
operate on smaller search space
Backward pass Generate top-scoring N-best word sequences (100 <=
N <= 300)
N-best rescoring pass: Tree Rescoring algorithm
![Page 11: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/11.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 11
Speech Recognition (Cont’d)Speech Recognition (Cont’d)
Speedup algorithms Fast Gaussian Computation (FGC) Grammar Spreading N-Best Tree Rescoring
Word error rate PII 450-MHz processor, 60000-word vocabulary 3 x RT => 21.4% 10 x RT => 17.5% 230 x RT => 14.8%
![Page 12: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/12.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 12
Speaker RecognitionSpeaker Recognition
Speaker segmentation Segregate audio streams based on the speaker
Speaker clustering Groups together audio segments that are from the
same speaker
Speaker identification Recognizes those speakers of interest whose voices
are known to the system
![Page 13: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/13.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 13
Speaker SegmentationSpeaker Segmentation Two-stage approach to speaker change
detection First: Detects speech/nonspeech boundaries Second: Perform actual speaker segmentation within
the speech segments
First stage Collapse the phoneme into three broad classes
(vowels, fricatives, and obstruents) Include five nonspeech models (music, silence/noise,
laughter, breath, and lip-smack) 5-states HMM Detection reliability over 90% of the time
![Page 14: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/14.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 14
Speaker Segmentation Speaker Segmentation (Cont’d)(Cont’d)
Second stage Hypotheses a speaker change boundary at every
phone boundary located in the first stage Speaker change decision takes the form of a
likelihood ratio (λ) test
Nonspeech region
Speech region
λ <= t
λ > t
λ <= t + α
λ > t + α
Same speaker
otherwise
![Page 15: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/15.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 15
Speaker ClusteringSpeaker Clustering
The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated
To find the cut of the tree that is optimal based on criterionK: number of clusters for any particular cut
of treeNj: number of feature vectors in cluster j
Log of determinant of the within-cluster dispersion matrix
Compensation for the previous term
![Page 16: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/16.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 16
Speaker Clustering (Cont’d)Speaker Clustering (Cont’d)
The algorithm performs well regardless of the true number of speakers, producing clusters of high purity
The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8%
![Page 17: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/17.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 17
Speaker IdentificationSpeaker Identification
Every speaker cluster created in the speaker clustering stage is identified by gender
The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models
This approach has resulted in a 2.3% error in gender detection
![Page 18: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/18.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 18
Speaker Identification Speaker Identification (Cont’d)(Cont’d)
In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers
The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments
![Page 19: Speech and Language Technologies for Audio Indexing and Retrieval](https://reader036.vdocuments.us/reader036/viewer/2022062308/56812a43550346895d8d6f28/html5/thumbnails/19.jpg)
2001/03/29 Chin-Kai Wu, CS, NTHU 19
Speaker Identification Speaker Identification (Cont’d)(Cont’d) The system resulted in three types of errors
False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker
False rejection rate of 3.0%, where a known-speaker segment was classified as unknown
False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers