demo end of speech
TRANSCRIPT
-
8/16/2019 Demo End of Speech
1/45
PCS Research & Advanced Technology Labs
Speech Lab
How to deal with the noise in real systems?
Hsiao-Chun Wu
Motorola PCS Research and Advanced
Technology Labs, Speech Laboratory
Phone: (815) 884-3071
-
8/16/2019 Demo End of Speech
2/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Why do we need to study noise?
Noise exists everywhere. It affects the performance of signal
processing in reality. Since the noise cannot be avoided by system
engineers, modern “noiseprocessing! technology has been researched
and designed to overcome this problem. "ence many related research
areas have been emerging, such as signal detection, signal
enhancement#noise suppression and channel e$uali%ation.
-
8/16/2019 Demo End of Speech
3/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
& Spectral 'runcation
( Spectral Subtraction )*++-
& 'ime 'runcation ( Signal /etection
& Spatial and#or 'emporal 0iltering
( 1$uali%ation ( 2rray Signal Separation )3lind Source Separation-
"ow to deal with noise? 4ut it off5555
)()(~
)()()(~
)(
f S f N f N f S f S
f R
≈
noiseT nr ∈≈ τ τ τ τ -,)-)6
)()()()()(~
)(
t st st ht wt s
t r
≈
)()()()()(~
)(
t S t S t H t W t S
t R
≈
-
8/16/2019 Demo End of Speech
4/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Session 1. On-line Automatic End-of-speech Detection
Algorithm (Time Truncation)*. 7ro8ect goal.
9. :eview of current methods.;. Introduction to voice metric based endofspeech
detector.
-
8/16/2019 Demo End of Speech
5/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
1. Project Goal:
• Problem
– Digit-dial recognition with unknown digit string length
• Solution 1
– fixed length window such as 10 seconds? (inconvenience to users)
• Solution 2
– Dynamic termination of data capture? (need a robust detection
algorithm)
-
8/16/2019 Demo End of Speech
6/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
• Research and design a robust dynamic termination mechanism for speech
recognizer.
– a new on-line automatic end-of-speech detection algorithm with small
computational complexity.
• Design a more robust front end to improve the recognition accuracy for
speech recognizers.
– a new algorithm can also decrease the excessive feature extraction of redundant
noise.
-
8/16/2019 Demo End of Speech
7/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
2. Review of Current Methods:
Most speech detection algorithms can be characterized into three categories.
• Frame energy detection
– short-term frame energy (20 msec) can be used for speech/noise
classification.
– it is not robust at large background noise levels.
• Zero-crossing rate detection– short-term zero-crossing rate can also be used for speech/noise
classification.
– it is not robust in a wide variety of noise types.
• Higher-order-spectral detection
– short-term higher-order spectra can be used for speech/noiseclassification.
– it implies a heavy computational complexity and its threshold is
difficult to be pre-determined.
-
8/16/2019 Demo End of Speech
8/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
3. Introduction to Voice Metric Based End-of-speech
Detector:
• End-of-speech detection using voice metric features is based on the Mel-
energies. Voice metric features are robust over a wide variety of background
noise. Originally voice metric based speech/noise classifier was applied for
IS-127 CELP speech coder standard. We modify and enhance voice-metric
features to design a new end-of-speech detector for Motorola voice
recognition front end (VR LITE III).
-
8/16/2019 Demo End of Speech
9/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
-
8/16/2019 Demo End of Speech
10/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
-
8/16/2019 Demo End of Speech
11/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
-
8/16/2019 Demo End of Speech
12/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
-
8/16/2019 Demo End of Speech
13/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
-
8/16/2019 Demo End of Speech
14/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
CS h & Ad d h l b
-
8/16/2019 Demo End of Speech
15/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
voice metric score table
PCS R h & Ad d T h l L b
-
8/16/2019 Demo End of Speech
16/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
7reS#N4lassifier
>oiceetric
elSpectrum
SN: 1stimate
1@S
3uffer
'hreshold
2daptation
raw data 00'
Speech
Start?
Silence
/uration
'hreshold
7ostS#N
4lassifier
voice metric scores
@riginal >: AI'1 0ront 1nd
1ndofspeech /etector data capture stops
yes
no
PCS R h & Ad d T h l L b
-
8/16/2019 Demo End of Speech
17/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
>: AI'1
recognitionengine
feature vector
frame buffer
segmentation
of speech intoframes
data capture
terminates
end of
speech?
yes
no frame i net frame i!*
speech
input
front end
with endofspeech
detector
PCS R h & Ad d T h l L b
-
8/16/2019 Demo End of Speech
18/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
B.=* seconds
;.C seconds
-
8/16/2019 Demo End of Speech
19/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
4orrect
detection
1nd
point
0alse
detection
falsedetection
time error
correctdetection
time error
String “2-2-9-1-7-8” in Car 55 mph
seconds
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
20/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
4. Simulation Results: (Simulation is done over Motorola digit-stringdatabase, including 16 speakers and 15,166 variable-length digit strings in 7
different conditions. Silence threshold is 1.85 seconds.)A. Receiver Operating Curve (ROC): ROC curve is the
relationship between the end-of-speech detection rate versus the
false (early) detection rate. We compare two different methods,
namely, (1) new voice-metric based end-of-speech detector and
(2) old speech/noise flag based end-of-speech detector.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
21/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
:@4 curve
false detection rate )D-
d e t e c t i o n r
a t e ) D -
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
22/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
• B. String-accuracy-convergence (SAC) curve: SAC
curve is the relationship between the string recognition accuracy
versus the false (early) detection rate. We compare two different
methods, namely, (1) new voice-metric based end-of-speech
detector and (2) old speech/noise flag based end-of-speech
detector.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
23/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
false detection rate )D-
s t r i n g r e c o g n i t i o n
a c c u r a c y ) D -
S24 curve
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
24/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
C. Table of detection results: (This table illustrates the result amongthe Madison sub-database including data files with 1.85 seconds or more of
silence after end of speech.)
4ondition 2verage
'ime 1rror
2verage
0alse/etection
'ime 1rror
2verage
4orrect/etection
'ime 1rror
0alse
/etection:ate
String
Numbers
'otal
/etection:ate
@verall 1.98 sec 1.68 sec 1.85 sec 0.47 7!418 86.08@ffice
4losetalE
*.+C sec F sec *.+; sec FD +FC +
-
8/16/2019 Demo End of Speech
25/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
(This table illustrates the result over the small database collected by Motorola
PCS CSSRL. All digits strings are recorded in 15 seconds of fixed window)
4ondition 2verage
'ime 1rror
2verage
0alse/etection
'ime 1rror
2verage
4orrect/etection
'ime 1rror
0alse
/etection:ate
String
Numbers
'otal
/etection:ate
String
:ecognition2ccuracy
)w#i 1@S-
String
:ecognition2ccuracy
)w#o 1@S-
@verall 1.82sec"n#s
0 sec"n#s 1.82
sec"n#s
0 121 96.69 50.41 29.75
@ffice4losetalE
*.=seconds
F seconds *.=seconds
FD 9* *FFD BB.BCD B*.+FD
@ffice2rmlength
*.<seconds
F seconds *.<seconds
FD 9F *FFD B=.FFD B=.FFD
4afG4losetalE
*.CBseconds
F seconds *.CBseconds
FD
-
8/16/2019 Demo End of Speech
26/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
2nalysis of the Simulation :esult Why didnHt 1@S
detection worE well in babble noise?
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
27/45
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
@ptimal /etection /ecision
& 3ayes classifier
& AiEelihood :atio 'est
)$%(&"g')$%(&"g' n f
H
H
ns f
n
s
〈
〉
$)(
)(&"g'!)(
)$%(&"g')$%(&"g')(
ns f
n f T T
H
H
L
n f ns f L
"ayes "ayes
n
s
=
〈
〉
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
28/45
CS & gy
Speech Lab November 14, 2000
/igit “one! in closetalEing mic, $uiet office
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
29/45
gy
Speech Lab November 14, 2000
/igit “one! in handsfree mic, == mil#h car
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
30/45
gy
Speech Lab November 14, 2000
/igit “one! in fartalEing mic, cafeteria
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
31/45
gy
Speech Lab November 14, 2000
5. Conclusion:• New voice-metric based end-of-speech detector is robust over a wide
variety of background noise.
• Only a small increase in the computational complexity will be brought by
new voice-metric based end-of-speech detector and it can be real-time
implementable.
• New voice-metric based end-of-speech detector can improve recognition
performance by discarding extra noise due to the fixed data capture
window.
• New voice-metric based end-of-speech detector needs further improvement
in the babble noise environment.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
32/45
Speech Lab November 14, 2000
Session 2. Speech Enhancement Algorithms: Blind
Source Separation Methods (Spatial and TemporalFiltering)1. Motivation and research goal.
2. Statement of “blind source separation” problem.
3. Principles of blind source separation.
4. Criteria for blind source separation.
5. Application to blind channel equalization for digital
communication systems.6. Simulation and comparison.
7. Summary and conclusion.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
33/45
Speech Lab November 14, 2000
1. Motivation:
• Mimic human auditory system to differentiate the subject signals from othersounds, such as interfered sources, background noise for clear recognition of the
subject contents.
• ‘One of the most striking facts about our ears is that we have two of them--and
yet we hear one acoustic world; only one voice per speaker .’ (E. C. Cherry andW. K. Taylor. Some further experiments on the recognition of speech, with one
and two ears. Journal of the Acoustic Society of America, 26:554-559, 1954)
• The ‘‘cocktail party effect’’--the ability to focus one’s listening attention on a
single talker among a cacophony of conversations and background noise--hasbeen recognized for some time. This specialized listening ability may be because
of characteristics of the human speech production system, the auditory system, or
high-level perceptual and language processing.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
34/45
Speech Lab November 14, 2000
Research Goal:
Design a preprocessor with digital signal processing speech
enhancement algorithms. The input signals are collected through
multiple sensor (microphone) arrays. After the computation of
embedded signal processing algorithms, we have clearly separated
signals at the output.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
35/45
Speech Lab November 14, 2000
2udio Input
3lind Source Separation 2lgorithms
1nhanced @utput
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
36/45
Speech Lab November 14, 2000
2. Problem Statement of Blind Source Separation:
What is “ Blind Source Separation!?
Sensor 1 Sensor N
Signal 1 Signal M
Received input signals
Given the N linearly mixed received input signals,
we need to recover the M statistically independent
sources as much as possible ( ). # N ≥
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
37/45
Speech Lab November 14, 2000
Formulation of Blind Source Separation Problem:
2 received signal vector from the array, $ )t -, is the original source vector S )t -
through the channel distortion H )t -, such that $ )t - H )t - S )t -, where
and
We need to estimate a separator W )t - such that
where
[ ] [ ]T M T
N t st st S t xt xt X -)-)-),-)-)-) ** ==
=-)-)
-)
-)-)
-)
*
***
t ht h
t h
t ht h
t H
NM N
ij
M
⊗
[ ] -)-)FF-)-)-)6
* t X t W t st st S T
M ⊗=≈
=
-)-)
-)
-)-)
-)
*
***
t wt w
t w
t wt w
t W
NN N
p
N
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
38/45
Speech Lab November 14, 2000
3. Principles of Blind Source Separation:
'he independence measurement ShannonHs M!t!a" in#ormation.
F-,,,)-)-,,,) 9**
9* ≥−∑==
N
N
ii N $ $ $ H $ H $ $ $ %
∑−=
=
$
i i $ N & N $ # ' $ $ $ # ' $ $ $ %
i*9*9*-JK)LlogM-JK,,,)LlogM-,,,)
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
39/45
Speech Lab November 14, 2000
4. Criteria to Separate Independent Sources:
• Constrained Entropy (Wu, IJCNN99):–
• Hardamard Measure (Wu, ICA99):
–
• Frobenius Norm (Wu, NNSP97):
–
• Quadratic Gaussianity (Wu, NNSP99):
–
∑
=
N
i i i i y f W %
101 )$!!(&"g'$)#et(&"g'
)$'&"g($)'(&"g2T T ' ' diag %
2
* $)'($'(
T T ' diag ' %
i i ) i dy y f y f % i
2
4 )()( ∞
∞
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
40/45
Speech Lab November 14, 2000
We apply the minimization of modified constrained entropy
to adapt an equalizer w(t ) =[w0, w1, ....] for
a digital channel h(t ). Assume a PAM signal constellation with symbols s(t ) = ,
passing through a digital channel h(t ) = [c(t , 0.11) + 0.8c(t -1, 0.11) - 0.4c(t -3,
0.11)]W 6T (t ),
where is raised-cosine function with
roll-off factor β and is a rectangular window. the input signal
to the equalizer is where n(t ) is the background noise.
We applied generalized anti-Hebbian learning to adapt w(t )
such that .
5. Application to Blind Single Channel Equalization
for Digital Communication Systems:∑
=
N
i i i i
N y f w %
101 )$!!(&"g')&"g(
1
2
22
41
)c"s()(sin)!(
T
t
T
t
T
t c
t c
π
=
)()()( t t ht w δ
)6
(6T
t rect W T =
∑
τ
τ )()()()( t nst ht
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
41/45
Speech Lab November 14, 2000
Signaltonoise :atio )d3-
Signalto
interf
erence:atio)d3
-
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
42/45
Speech Lab November 14, 2000
Signaltonoise :atio )d3-
3 i t 1 r r
o r : a t e
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
43/45
Speech Lab November 14, 2000
6. Simulation and Comparison:
The simulation results for comparison among our generalized
anti-Hebbian learning, SDIF algorithm and Lee’s Informax method
(Lee IJCNN97) over three real recordings downloaded from Salk
Institute, University of California at San Diego.
PCS Research & Advanced Technology Labs
-
8/16/2019 Demo End of Speech
44/45
Speech Lab November 14, 2000
New >: AI'1 0rontend 3lind Source Separation 1nd
ofspeech /etection
schemes 2verage/etection
'ime 1rror
2verage0alse
/etection'ime1rror
2verage4orrect
/etection'ime1rror
Number of Strings
0alse/etection
:ate
'otal/etection
:ate
1@Sonly
F.9=Bseconds
F.*==seconds
F.;*Cseconds
*< C.*
-
8/16/2019 Demo End of Speech
45/45
7. Conclusion and Future Research:
& 'he computational efficiency of blind source separation needs
to be reduced.
& 'est 3SS for 1@S detection under microphone arrays of the
same Eind.
& Incorporate other array signal processing )beamformer?-
techni$ue to improve speech detection and recognition.