computer science department

Computer Science Department

A Speech / Music Discriminator using RMS and A Speech / Music Discriminator using RMS and Zero-crossingsZero-crossings

Costas Panagiotakis and George Tziritas

Department of Computer Science University of CreteHeraklion Greece


Presentation Organization

I. Introduction II. SegmentationIII. ClassificationIV. ResultsV. Conclusion

EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France 11


Introduction (1/3)Input

Figure 1: Original Sound Signal (44100 or 22050 sample rate)

Output

Figure 2: Real time Segmentation and Classification (Speech,Music,Silence)

EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France 22


Introduction (2/3)Approaches

Basic purpose

•Features extraction (energy,frequency)

•Feature based Segmentation and Classification

•Real time segmentation and classification

•Algorithmic - computation constraints

•Low feature number

•Low change extraction error (20 msec)

•Low minimum distance between two changes (1 sec)

•High accuracy (95 %)

33EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France


Introduction (3/3)

Root Mean Square (RMS)

Basic Features

Zero Crossings (ZC)

•Computed every 20 msec

•Independent characteristics

Signal energy

Figure 3: RMS in music Figure 4: RMS in speech

Figure 5: ZC in music Figure 6: ZC in speech

Mean frequency

1

N

n

x n( )2

=A =



Segmentation (1/3)

Basic characteristics RMS based χ2 distribution fits well the RMS histograms

Two stage algorithmStage 1

•1 sec accuracy (low computation cost)

Stage 2 •20 msec accuracy (high computation cost)

m : mean , s2 : variance

Figure 8: Histogram RMS in speech, approximation by χ2 distribution

Figure 7: Histogram RMS in speech, approximation by χ2 distribution

p(x) = xa e bx

ba 1 Gamma a 1( )x 0

a = m2

s21 b =

s2

m

Γ(Γ( a + 1) a + 1)


Stage 1•Partitioning in 1 sec frames (50 RMS values)•Change in Frame i Frame i-1 and Frame i+1 have to differ•Computation of frame distance D (Matusita Distance) using frame similarity (p)

•Frame i is candidate for Stage 2 (there is a change)If D(i) > threshold and D(i) local maximal

p x( ) xp1

x( ) p2

x( ) d D i( ) 1 p pi 1 pi 1


Segmentation (2/3)

p( pp( p11 , p , p2 2 ))


RMSRMS

timetimeFrame i-1 Frame i+1

HIGH

Frame i Frame i+21 sec frames1 sec frames

DistanceDistance

Change in frame iChange in frame i

LOW


Segmentation (3/3)

Stage 2•20 msec accuracy

•for each candidate frame (i) from stage 11. move 2 successive frames (1 sec) located before and after frame (i)2. find the time instant where the 2 successive frames have the maximum Matusita distance in RMS distribution

•Possible oversegmentation

Figure 10: The RMS data and the distance D

Figure 11: The segmentation result and the RMS data



Classification (1/4)

Basic purpose Segment classification in one of following classes

•Music•Speech•Silence

Main Algorithm •Hypothesis

Segmentation gives homogenous segments

•Input Basic characteristics RMS, ZC

•Actual features computation of segment

•Classification based on actual features values




Actual Features specification •Normalized RMS variance, σ2

Α

σ2Α =

Usually (86 %) σ2Α(music) < σ2

Α (speech)

•The probability of null ZC, ZC0Always ZC0 (music) = 0 Usually (40%) ZC0 (speech) > 0

•Maximal mean frequency, max(ZC)Almost always in speech max(ZC) < 2.4 kHz In 2% of the cases in music max(ZC) > 2.4 kHz

var RMS( )

mean RMS( )( )2



•Joint RMS/ZC measure, Cz Speech : High correlation RMS, ZC

many void intervals low RMS and ZC

Music : Essentially independent RMS, ZC

•Void intervals frequency, FuVoid intervals detection ( 20 msec ):

(RMS < T1) && (RMS < 0.1•max(RMS(i)) && (RMS < T2) || (ZC = 0)

Group neighborly silent intervals

Fu : frequency of grouped silent intervals

Always in speech Fu > 0.6

In at least 65% of music Fu < 0.6

iA

Actual Features specification Classification (3/4)



Silence segment recognition

Segment is silence E < Threshold

E 0.7 median RMS i( )( )

0.3

i

RMS i( )

A

A

i A


Decision making algorithm

ομιλία

Silence segment check

Actual features check Silence

speech music



Data Data source

Segmentation performance

Results

11.328 sec speech 3.131 sec music

70% audio CDs15% WWW15% recordings

Actual features performance

•97% detection probability

•Change accuracy ~ 0.2 sec

FeaturesFeatures


σσ22ΑΑ

Cz Cz Cz Cz σσ22

ΑΑ ZC0 ZC0 σσ22

ΑΑ

Fu Fu σσ22

ΑΑ

AllAll CzCz

Acc

urac

yA

ccur

acy

ZC0ZC0 σσ22ΑΑ ,

ZC0 ZC0 σσ22

ΑΑ

FeaturesFeatures


Complexity Conclusion

Summary

•Minimum complexity O(N)•Low computation cost

•Real time segmentation and classification in three classes•Energy distribution (RMS) suffices for segmentation•RMS – ZC suffices for classification•Purpose : minimum cost and high performance

Future extension•Content-based indexing and retrieval audio signals•Pre-processing stage for speech recognition



Segmentation - Classification Demo


Sound Player Demo

computer science department

Documents