acoustic analysis of babel recording conditionsdpwe/talks/babel-acoustic-2014-01.pdfbabel audio...
TRANSCRIPT
-
Babel Audio Analysis - Ellis 2014-01-09 /15
���1
1. Environment Types
2. Acoustic Condition Histograms
3. Microphone Condition
4. Conclusions
Acoustic Analysis of Babel Recording Conditions
!Dan Ellis
Columbia & ICSI
Team SWORDFISH
[email protected] http://labrosa.ee.columbia.edu/
mailto:[email protected]://labrosa.ee.columbia.edu
-
Babel Audio Analysis - Ellis 2014-01-09 /15
1. Environment Types• Each Babel recording has envType metadata
STREET, PUBLIC_PLACE, HOME_OFFICE_MOBILE, VEHICLE, CAR_KIT, HOME_OFFICE_LANDLINE
���2
• Substantial effect of condition & lang, e.g.:
BP_105 CK vs. HOM
BP_108 CK vs. HOL
BP Kaldi on devCK V PP S HOM HOL TOT
52 46 116 83 304 60 661
Sum − Error breakdown by condition
condition
CK V PP S HOM HOL TOT0
0.2
0.4
0.6
14 9 13 13 54 17 120
BP_101 − Error breakdown by condition
erro
r rat
e
CK V PP S HOM HOL TOT11 6 22 12 83 2 136
BP_104 − Error breakdown by condition
CK V PP S HOM HOL TOT0
0.2
0.4
0.6
9 8 20 15 64 11 127
BP_105 − Error breakdown by condition
erro
r rat
e
CK V PP S HOM HOL TOT10 13 36 30 43 14 146
BP_106 − Error breakdown by condition
CK V PP S HOM HOL TOT0
0.2
0.4
0.6
8 10 25 13 60 16 132
BP_107 − Error breakdown by condition
condition
erro
r rat
e
subdelins
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Variations• Per-condition averages hide spreads
still substantial differences between conditions
���3
CK V PP S HOM HOL TOT00.20.40.60.8
1
14 9 13 13 54 17 120
BP_101 − Error breakdown by condition
erro
r rat
e
CK V PP S HOM HOL TOT11 6 22 12 83 2 136
BP_104 − Error breakdown by condition
CK V PP S HOM HOL TOT00.20.40.60.8
1
9 8 20 15 64 11 127
BP_105 − Error breakdown by condition
erro
r rat
e
CK V PP S HOM HOL TOT10 13 36 30 43 14 146
BP_106 − Error breakdown by condition
CK V PP S HOM HOL TOT00.20.40.60.8
1
condition
8 10 25 13 60 16 132
BP_107 − Error breakdown by condition
erro
r rat
e
CK V PP S HOM HOL TOTcondition
52 46 116 83 304 60 661
Sum − Error breakdown by condition
But… within-condition variation exceeds between-condition
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Characterizing Acoustics!
• Mel Spectrogram
!!
• Histogram of dB energy in each Mel band
!
• Energy covariance
���4
freq
/ Hz
time / s
BABEL_OP1_206_65882_20121201_174526_outLine_PP
0 1 2 3
579117721373882
freq / Hzlev
el / d
B
579 1177 2137 3882
20406080
100120
freq / Hz
freq
/ Hz
579 2137
579117721373882
40
60
80
20406080100
0
0.05
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Example: BP_101 Car_Kit• Gating by subbands
���5
0
50
100
level
/ dB
CAR KIT - BP 101 43086 20111025 160708 in
1500 Hz chan
dB
0 1000 2000 30000
50
100HOME LAND - BP 101 84943 20111020 144955 in
0 1000 2000 3000freq / Hz freq / Hz
0
50
100
freq
/ Hz
time / s87 90 93 96 99
2000
4000
time / s
level / dB
67 70 73 76 79
level
/ dB
BABEL_BP_101_43086_20111025_160708_inLine − CK
0
50
100
freq
/ Hz
579
1177
2137
3882BABEL_BP_101_84943_20111020_144955_inLine − HOL
0
50
100
freq
/ Hz
579
1177
2137
3882
freq / Hz579 1177 2137 3882
freq / Hz579 3882
freq / Hz579 1177 2137 3882
freq / Hz579 3882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Top-Level View - BP Languages• All BP_1xx dev set utterances pooled
���6
level
/ dB
BP_101 − ALL
1200
50
100
freq
/ Hz
579117721373882
BP_104 − ALL
1360
50
100
freq
/ Hz
579117721373882
level
/ dB
BP_105 − ALL
1270
50
100
freq
/ Hz
579117721373882
BP_106 − ALL
1460
50
100
freq
/ Hz
579117721373882
level
/ dB
BP_107 − ALL
freq / Hz
132579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
Corpora − ALL
freq / Hz
661579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Drilling Down: BP_105 (Turkish)• DEV utterances sorted by envType
���7
level
/ dB
bp_105 − CAR_KIT
90
50
100
freq
/ Hz
579117721373882
bp_105 − VEHICLE
80
50
100
freq
/ Hz
579117721373882
level
/ dB
bp_105 − PUBLIC_PLACE
200
50
100
freq
/ Hz
579117721373882
bp_105 − STREET
150
50
100
freq
/ Hz
579117721373882
level
/ dB
bp_105 − HOME_OFFICE_MOBILE
freq / Hz
64579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
bp_105 − HOME_OFFICE_LANDLINE
freq / Hz
11579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
BP_105 Car_Kit vs. HO_Mob
���8
level
/ dB
BABEL_BP_105_31256_20120531_015506_inLine − CK
WER=79.60
50
100
579117721373882
level
/ dB
BABEL_BP_105_42212_20120706_194059_inLine − CK
WER=66.30
50
100
579117721373882
level
/ dB
BABEL_BP_105_39774_20120623_021946_inLine − CK
WER=54.90
50
100
579117721373882
freq / Hz
level
/ dB
BABEL_BP_105_31345_20120515_214849_inLine − CK
WER=49.1579 1177 2137 3882
0
50
100
freq / Hz
579 3882
579117721373882
BABEL_BP_105_20213_20120123_011920_outLine − HOM
WER=92.30
50
100
579117721373882
40
60
80
BABEL_BP_105_48536_20120208_212737_outLine − HOM
WER=56.90
50
100
579117721373882
BABEL_BP_105_66883_20120207_051718_inLine − HOM
WER=46.10
50
100
579117721373882
freq / Hz
BABEL_BP_105_87806_20120201_235442_outLine − HOM
WER=28.8579 1177 2137 3882
0
50
100
freq / Hz
579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Top-Level View - OP1 Languages
���9
level
/ dB
OP1_102 − ALL
1260
50
100
freq
/ Hz
579117721373882
OP1_103 − ALL
1250
50
100
freq
/ Hz
579117721373882
level
/ dB
OP1_201 − ALL
1260
50
100
freq
/ Hz
579117721373882
OP1_203 − ALL
1310
50
100
freq
/ Hz
579117721373882
level
/ dB
OP1_206 − ALL
freq / Hz
141579 1177 21373882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
Corpora − ALL
freq / Hz
649579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Drilling Down: OP1_206 (Zulu)
���10
level
/ dB
op1_206 − CAR_KIT
90
50
100
freq
/ Hz
579117721373882
op1_206 − VEHICLE
130
50
100
freq
/ Hz
579117721373882
level
/ dB
op1_206 − PUBLIC_PLACE
200
50
100
freq
/ Hz
579117721373882
op1_206 − STREET
90
50
100
freq
/ Hz
579117721373882
level
/ dB
op1_206 − HOME_OFFICE_MOBILE
freq / Hz
77579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
op1_206 − MICROPHONE
freq / Hz
13579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Drilling Down: OP1_201 (Creole)
���11
level
/ dB
op1_201 − CAR_KIT
120
50
100
freq
/ Hz
579117721373882
op1_201 − VEHICLE
130
50
100
freq
/ Hz
579117721373882
level
/ dB
op1_201 − PUBLIC_PLACE
120
50
100
freq
/ Hz
579117721373882
op1_201 − STREET
130
50
100
freq
/ Hz
579117721373882
level
/ dB
op1_201 − HOME_OFFICE_MOBILE
freq / Hz
63579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
op1_201 − MICROPHONE
freq / Hz
13579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579117721373882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
OP1_201 HO_Mob vs. Mic
���12
BABEL_OP1_201_16184_20130305_081912_inLine − HOM
0
50
100
freq
/ Hz
579
1177
2137
3882
BABEL_OP1_201_47877_20130429_092603_outLine − HOM
0
50
100fre
q / H
z
579
1177
2137
3882
BABEL_OP1_201_70110_20130224_022802_outLine − HOM
0
50
100
freq
/ Hz
579
1177
2137
3882
BABEL_OP1_201_99813_20130514_080612_outLine − HOM
freq / Hz579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579
1177
2137
3882
level
/ dB
BABEL_OP1_201_26074_20130522_003756_inLine − Mic
0
50
100
freq
/ Hz
579
1177
2137
3882
level
/ dB
BABEL_OP1_201_54162_20130508_044116_inLine − MIc
0
50
100
freq
/ Hz
579
1177
2137
3882
level
/ dB
BABEL_OP1_201_71263_20130602_021725_inLine − Mic
0
50
100
freq
/ Hz
579
1177
2137
3882
level
/ dB
BABEL_OP1_201_82035_20130601_052036_inLine − Mic
freq / Hz579 1177 2137 3882
0
50
100
freq
/ Hz
freq / Hz579 3882
579
1177
2137
3882
-
Babel Audio Analysis - Ellis 2014-01-09 /15
2xx Wideband Mic Channel• 48 kHz sampling rate
• Single noise floor
• Extremely broad-band
���13
freq / Hz
leve
l / d
B
0 1000 2000 30000
50
100
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Blind SNR Estimation
• NIST STNR - histogram-based
confused by double noise floor
!
• SNRvad - VAD-based
tricked by noise gating
���14
0 10 20 30 40 50 60 705
10
15
20
25
30
35
40
45
50
NIST stnr
SNR
vad
NIST STNR vs SNRvad for BABEL OP1 206 train
sphwav
-
Babel Audio Analysis - Ellis 2014-01-09 /15
Summary• Acoustic environment affects performance
impact varies with different languages
• Cellphone noise gating gives complex
histograms
worse in CAR_KIT
confuses simple SNR estimation
• Mic channel is extremely high quality
limited data - hard to exploit
• Opportunities
improved normalization?
clustering of channel characteristics
���15