btp_y11uc158 _ppt.pdf
TRANSCRIPT
-
Perceptual WPT and time-adaptive level thresholding based enhancement of
degraded speech
Presented by
Nitesh Kumar Chaudhary
Department of Electronics & Communication Engineering
The LNM Institute Of Information Technology, Jaipur
Under the Supervision of
Dr. Navneet upadhyay
-
Why speech enhancement ?...
The presence of noise in speech can significantly reduce the intelligibility ofspeech and degrade automatic speech recognition performance.
Reduction of noise has become an important issue in speech signal processingsystem, such as speech coding and speech recognition system.
(a) Additive acoustic noise - such as the noise added to the speech signal whenrecorded in an environment with noticeable background noise, like in an aircraftcockpit.
(b) Acoustic reverberation - results from the additive effect of multiple reflectionsof an acoustic signal.
(c) Convolutive channel effects - resulting in an uneven or band-limited response,can result when the communication channel is not modeled effectively for thechannel equalizer to remove the channel impulse response.
.
-
(d) Electrical interference
(e) Codec distortion - distortion caused by the coding algorithm due to compression
(f) Distortion introduced by recording apparatus - poor response of microphone
Keywords: Perceptual Wavelet packet transform (PWPT), Time adaptive Thresholding,
TEO, Probability of detection Pd and false alarm Pf, Masking.
-
Block Diagram
Perceptual WPTTeager Energy
Operator
Critical Band
Selection
level
dependent
Thresholding
Inverse PWPT
VAS & Time
adaptive
Thresholding
Recovered
Clean Signal
Y(n)
Noisy Signal
X(n)
Wj,m (K)
m =1...17
tj,m (K)
m =1...17
m =1...17
m =1...17m =1...17
Mj,m (K)
Lj,m (K)Wm (n)
-
Perceptual Wavelet Packet Transform :
The Wavelet Packet Transform (WPT) is one such time frequency analysis
tools. It is a transform that brings the signal into a domain that contains both
time and frequency information.
In wavelet analysis, a signal is split into an approximation and a detail. The
approximation is then itself split into a second-level approximation and detail,
and the process is repeated.
In the corresponding Perceptual wavelet packet situation, each detail coefficient
vector is also decomposed into two parts using the same approach as in
approximation vector splitting and 17 critical bands are selected because for
speech with 8 kHz sampling rate, 17 critical bands are required to cover the
entire range of frequency
-
(0,0)
(1,0) (1,1)
(2,0) (2,1) (2,2) (2,3)
(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7)
(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7) (4,8) (4,9)
(5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (5,7)
Wavelet Decomposition
De
co
mp
os
itio
n L
ev
el
0.5 1 1.5 2
x 104
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Sig
na
l M
ag
nit
ud
e
Sample Point
data1
data2
data3
data4
data5
data6
data7
data8
data9
data10
data11
data12
data13
data14
data15
data16
data17
data18
data19
data20
data21
data22
data23
data24
data25
data26
data27
data28
data29
data30
data31
data32
Noisy Signal Wavelet Packet Decomposition
-
TEO & level dependent thresholding
TEO is powerful non-linear operator which has been successfully used in various
speech applications, TEO can then be used to estimate the second moment
angular bandwidth of a signal and the moments of a signal duration and that of
its spectrum.
TEO can determine the energy functions of quite complicated functions For a
given band limited signal, TEO introduced by Kaiser is given by
[()] = () ( + )( )
The time adaptive threshold selection for wavelet coefficients has been
computed, which takes care of varying noise time into account.
,() = , , {, }
,
-
(0,0)
(1,0) (1,1)
(2,0) (2,1) (2,2) (2,3)
(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7)
(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7) (4,8) (4,9)
(5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (5,7)
Wavelet Decomposition
De
co
mp
os
itio
n L
ev
el
0.5 1 1.5 2
x 104
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Sig
na
l M
ag
nit
ud
e
Sample Point
-
For a selected band, mask is obtained by
The voice activity shape V(n) is calculated by
Masking Construction:
, = , (
Where * denotes the convolution operation and Hj(k) is 256
2 point level dependent
Hamming window.
=
=
()
Where Wm(n) is the inverse perceptual Wavelet packet tranform of Mj,m k in equation
-
Time adaptive threshold calculation : To determine this time-adaptive threshold value AWT, an iterative algorithm has been proposed .
=
. ,
) <
+
,
Where AWT(i) is the time adaptive threshold value of frame i, and frame(i) is defined as
Frame(i) = [V(( i-1)*160 + 1], [V(( i-1)*160],
Noise is defined as Noise(n) = p *{E[V(2)
(n)] + Mean(Frame(i))}/2
E[V(k)
(n)] is the mean of V(k)
(n).
The voice-active regions are characterized by V(n) > AWT
-
Level 3
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1
Node (3,5)
Frequency in Hz
Sig
na
l A
mp
litu
de
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1Node (3,6)
Frequency in Hz
Sig
na
l A
mp
litu
de
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1Node (3,7)
Frequency in Hz
Sig
na
l A
mp
litu
de
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1Node (3,5)
Frequency in Hz
Sig
na
l A
mp
litu
de
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1Node (3,6)
Sig
na
l A
mp
litu
de
Frequency in Hz
0 500 1000 1500 2000 2500 3000 3500-1
-0.5
0
0.5
1Node (3,7)
Frequency in Hz
Sig
na
l A
mp
litu
de
Noise Signal of level 3rd of Wavelet Tree Denoised Signal of level 3rd of Wavelet Tree
Level 3, node by node denoising
-
Level 4
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,4)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,5)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,6)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,7)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,8)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,9)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,4)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,5)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,6)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,7)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,8)
Frequency in Hz
Am
p
0 200 400 600 800 1000 1200 1400 1600-1
0
1Node (4,9)
Frequency in Hz
Am
p
Denoised Signal Of Level 4th Of Wavelet TreeNoise Signal Of Level 4th Of Wavelet Tree
Level 4, node by node denoising
-
Level 5
Level 5, node by node denoising
0 200 400 600 800-1
0
1Node (5,0)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,1)
Frequency in HzA
mp
0 200 400 600 800-1
0
1Node (5,2)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,3)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,4)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,5)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,6)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,7)
Frequency in Hz
Am
p0 200 400 600 800
-1
0
1Node (5,0)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,1)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,2)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,3)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,4)
Frequency in Hz
Am
p0 200 400 600 800
-1
0
1Node (5,5)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,6)
Frequency in Hz
Am
p
0 200 400 600 800-1
0
1Node (5,7)
Frequency in Hz
Am
p
Noise Signal Of Level 5th Of Wavelet Tree Denoised Signal Of Level 5th Of Wavelet Tree
-
Evaluation
To verify the effectiveness of the proposed algorithms, we compared the speech detection
and false-alarm probabilities
The proposed methods are all evaluated by receiver operating characteristic (ROC)
curves which show discriminative properties of VAD between noise-only and noisy
speech frames in terms of the Probability of Correct detection (Pd) and Probability of
false-alarm (Pf) such that
=
=
-
10-0.01
100
100.01
100.02
100.03
100.04
10-0.01
100
100.01
Pf: Probability of False alarm
Pd
: P
rob
ab
ilit
y o
f d
ete
cti
on
Performance Evaluation
20.6710 dB
shape-preserving
linear
-
Wavelet Filter type (filter
Length)
Probability Of Correct
Detection (Pd %)
Probability Of False Alarm
(Pf %)
Computation time
(CP)
Daubechies 2 86.4 15.6 2.872 s
Daubechies 4 89.3 11.7 2.884 s
Daubechies 8 91.8 9.2 3.023 s
Daubechies 10 94.3 5.7 3.074 s
Daubechies 12 94.5 5.5 3.898 s
Daubechies 14 94.8 5.2 3.899 s
The cost-performance (CP) is defined as
CP = ( )
Where the CP time is the average PWPT process time of specific wavelet. Considering the
cost performance rate given in Table 1, the Daubechies wavelet filter with length 12,
which has the best CP ratio, is recommended for the proposed algorithm.
-
References :
Shi-Huang Chen, HsinTe Wu, Yukon Chang and T.K. Truong Robust voice activity
detection using perceptual wavelet-packet transform and Teager energy operator in Pattern
Recognition Letters 28 (2007) 13271332.
Daubechies, I. (1992), Ten lectures on wavelets, CBMS-NSF conference series in applied
mathematics, SIAM Ed.
D. L. Donoho, I. M. Johnstone, Ideal Spatial Adaptation via Wavelet Shrinkage,
Biometrika, vol. 81, pp. 425-455, 1994.
S. Mallat, A theory for multiresolution signal decompo-sition: The wavelet representation,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 11, No. 7, pp. 674
693, July 1989.
M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of speech corrupted by acoustic
noise, in Proc. IEEE ICASSP, Apr. 1979, pp. 208211.
Johnstone, I.M., Silverman, B.W., 1997. Wavelet threshold estimators for data with correlated
noise. J. Roy. Stat. Soc. B 59, 319351.
G. David Forney, Jr., Exponential error bounds for erasure, list, and decision feedback
schemes, Information Theory, IEEE Transactions on, vol. 14, no. 2, pp. 206220, Mar 1968.
-
TEO is powerful non-linear operator which has
been successfully used in various speech
applications, TEO can then be used to estimate
the second moment angular bandwidth of a
signal and the moments of a signal duration and
that of its spectrum.
TEO can determine the energy functions of
quite complicated functions For a given band
limited signal, TEO introduced by Kaiser is
given by
The time adaptive threshold selection for
wavelet coefficients has been computed, which
takes care of varying noise time into account.