Download - Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 1

Speech and Language Speech and Language Technologies for Audio Technologies for Audio Indexing and RetrievalIndexing and Retrieval

JOHN MAKHOUL, FELLOW, IEEE,

FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG

NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE

PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000


OutlineOutline

Introduction Indexing and Browsing with Rough’n’Ready

Rough’n’Ready System Indexing and Browsing

Statistical Modeling Paradigm Speech Recognition Speaker Recognition

Segmentation Clustering Identification


IntroductionIntroduction Much of information will be in the form of speech from

various source.

It’s now possible to start building automatic content-based indexing and retrieval tools.

The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing.

The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval.


Rough’n’Ready systemRough’n’Ready system

ActiveX controls

MP3

Dual P733-MHz

Collect/Manage Archive

Interact with browser

ActiveX controls


Indexing and BrowsingIndexing and Browsing


Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)

Speaker

People

Place

Organization

Topic Labels


Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)

Selected from over 5500 topic labels


Statistic Modeling ParadigmStatistic Modeling Paradigm

Maximize P(output|input, model)

(desired recognized sequence of the data)


Speech RecognitionSpeech Recognition Statistic model: acoustic models, language

models

Acoustic model Describe the time-varying evolution of feature vectors

for each sound or phoneme Employ hidden Markov models (HMM) Gaussian mixture models the feature vector for each

HMM states Special acoustic models for nonspeech events: music,

silence/noise, laughter, breath, and lip-smack.

Language model: N-gram language model


Speech Recognition (Cont’d)Speech Recognition (Cont’d) Multipass recognition search strategy

Fast-match pass Narrows search space Followed by other passes with more accurate models

operate on smaller search space

Backward pass Generate top-scoring N-best word sequences (100 <=

N <= 300)

N-best rescoring pass: Tree Rescoring algorithm


Speech Recognition (Cont’d)Speech Recognition (Cont’d)

Speedup algorithms Fast Gaussian Computation (FGC) Grammar Spreading N-Best Tree Rescoring

Word error rate PII 450-MHz processor, 60000-word vocabulary 3 x RT => 21.4% 10 x RT => 17.5% 230 x RT => 14.8%


Speaker RecognitionSpeaker Recognition

Speaker segmentation Segregate audio streams based on the speaker

Speaker clustering Groups together audio segments that are from the

same speaker

Speaker identification Recognizes those speakers of interest whose voices

are known to the system


Speaker SegmentationSpeaker Segmentation Two-stage approach to speaker change

detection First: Detects speech/nonspeech boundaries Second: Perform actual speaker segmentation within

the speech segments

First stage Collapse the phoneme into three broad classes

(vowels, fricatives, and obstruents) Include five nonspeech models (music, silence/noise,

laughter, breath, and lip-smack) 5-states HMM Detection reliability over 90% of the time


Speaker Segmentation Speaker Segmentation (Cont’d)(Cont’d)

Second stage Hypotheses a speaker change boundary at every

phone boundary located in the first stage Speaker change decision takes the form of a

likelihood ratio (λ) test

Nonspeech region

Speech region

λ <= t

λ > t

λ <= t + α

λ > t + α

Same speaker

otherwise


Speaker ClusteringSpeaker Clustering

The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated

To find the cut of the tree that is optimal based on criterionK: number of clusters for any particular cut

of treeNj: number of feature vectors in cluster j

Log of determinant of the within-cluster dispersion matrix

Compensation for the previous term


Speaker Clustering (Cont’d)Speaker Clustering (Cont’d)

The algorithm performs well regardless of the true number of speakers, producing clusters of high purity

The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8%


Speaker IdentificationSpeaker Identification

Every speaker cluster created in the speaker clustering stage is identified by gender

The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models

This approach has resulted in a 2.3% error in gender detection


Speaker Identification Speaker Identification (Cont’d)(Cont’d)

In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers

The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments


Speaker Identification Speaker Identification (Cont’d)(Cont’d) The system resulted in three types of errors

False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker

False rejection rate of 3.0%, where a known-speaker segment was classified as unknown

False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers

Download - Speech and Language Technologies for Audio Indexing and Retrieval

Top Related