spoken document retrieval multimedia indexingdpwe/e6820/lectures/e6820-l12-apps.pdfe6820 sapr - dan...
TRANSCRIPT
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 1
EE
E6820: S
peech & A
udio Processing &
Recognition
Lectu
re 12:M
ultim
edia In
dexin
g
Sp
oken
do
cum
ent retrieval
Au
dio
datab
ases
Op
en issu
es
Dan E
llis <dpw
bia.edu>http://w
ww
.ee.columbia.edu/~
dpwe/e6820/
Colum
bia University D
ept. of Electrical E
ngineeringS
pring 2007
123
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 2
Sp
oken
Do
cum
ent R
etrieval (SD
R)
•20%
WE
R is h
orrib
le for tran
scriptio
n
-is it good for anything else?
•In
form
ation
Retrieval (IR
)
-T
RE
C/M
UC
‘spoken documents’
-tolerant of w
ord error rate, e.g.:
F0:
TH
E V
ER
Y E
AR
LY R
ET
UR
NS
OF
TH
E N
ICA
RA
GU
AN
PR
ES
IDE
NT
IAL E
LEC
TIO
N
SE
EM
ED
TO FA
DE
BE
FO
RE
TH
E LO
CA
L MAY
OR
ON
A LO
T O
F LA
WF
4:AT
TH
IS S
TAG
E O
F T
HE
AC
CO
UN
TIN
G F
OR
SE
VE
NT
Y S
CO
TC
H O
NE
LEA
DE
R
DA
NIE
L OR
TE
GA
IS IN
SE
CO
ND
PLA
CE
TH
ER
E W
ER
E T
WE
NT
Y T
HR
EE
P
RE
SID
EN
TIA
L CA
ND
IDAT
ES
OF
TH
E E
LEC
TIO
NF
5:T
HE
LAB
OR
MIG
HT
DO
WE
LL TO R
EM
EM
BE
R T
HE
LOS
T A
MA
JOR
EP
ISO
DE
OF
T
RA
NS
ATLA
NT
IC C
ON
NE
CT
TO A
CO
RP
OR
ATIO
N IN
BO
TH
CO
NS
ER
VAT
IVE
PAR
TY
O
FF
ICIA
LS F
RO
M B
RITA
IN G
OIN
G TO
WA
SH
ING
TON
TH
EY
WE
NT
TO W
OO
D B
UY
S
GE
OR
GE
BU
SH
ON
HO
W TO
WIN
A S
EC
ON
D TO
NO
NE
IN LO
ND
ON
TH
IS IS
S
TE
PH
EN
BE
AR
D F
OR
MA
RK
ET
PLA
CE
•P
rom
ising
app
lication
area
-docum
ent retrieval already hit-and-miss
-plenty of untranscribed m
aterial
1
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 3
Th
e TH
ISL
SD
R system
•O
rigin
al task: BB
C n
ewsro
om
sup
po
rt
•H
ow
to bu
ild th
e datab
ase:
-autom
atically record news program
s ‘off air’-
several hours per day
→
> 3,000 hrs
-run recognition the w
hole time
-problem
s storing audio!
Contro
l
Text
Audio
Vid
eo
Rece
iver
AS
R
NLP
Segm
enta
tion
AS
RIR
Da
tab
ase
Arch
iveQ
ue
ry
http
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 4
Bu
ildin
g a n
ew reco
gn
izer
•N
o m
od
els available fo
r BB
C E
ng
lish
-need to develop a new
recognizer based on US
E
nglish Broadcast N
ews, read B
ritish English...
•Train
ing
set: M
anu
al transcrip
tion
of 40 h
ou
rs of n
ews
-w
ord-level transcription takes > 10x real-tim
e-
Viterbi training, starting from
read speech model
•L
ang
uag
e mo
del:
200M w
ord
s of U
S &
UK
new
spap
er archives
•D
iction
ary: S
tand
ard U
K-E
ng
lish + exten
sion
s
-m
any novel & foreign w
ords
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 5
Vocabu
lary extensio
n
•N
ews alw
ays has n
ovel wo
rds
•S
tarting
po
int: Text-to
-speech
rules
-speech synthesizers’ rules for unknow
n words
-but novel w
ords are often foreign names
•S
ou
rces to id
entify n
ew w
ord
s
-B
BC
‘house style’ information
•C
ho
ose m
od
el by sing
le acou
stic examp
le
-grab from
TV
subtitles?
dhaxiy
dcld
daa
aw...
FS
G decoder .. .
Wordstrings
Dictionary
Phone trees
Letter-to-phonetrees
The D
ow Jones...
d o w
ax d aw
oror
FS
G building
FS
GA
coustics
dh dh dh dh ax ax dcl d...A
lignment
the=dh ax
dow=d aw�
jones=jh
Acoustic confidences
dh= 0.24 ax=
0.4 dcl=0.81...
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 6
Au
dio
segm
entatio
n
•B
road
cast aud
io in
clud
es mu
sic, no
ise etc.
•S
egm
entatio
n is im
po
rtant fo
r recog
nitio
n
-speaker identity tagging, m
odel adaptation-
excluding nonspeech segments
•C
an u
se gen
eric mo
dels o
f similarity/d
ifference
•L
oo
k at statistics of sp
eech m
od
el ou
tpu
t
-e.g. dynam
ism
00.05
0.10.15
0.20.25
0.3D
ynamism
0
0.5 1
1.5 2
2.5 3
3.5
Entropy
Speech
Music
Speech+
Music
0
2000
4000
frq/Hz0
24
68
1012
time/s
0 20 40 Sp
ectrog
ram
Po
steriors
speechm
usicspeech+
music
1N ----p
qn i
()
pq
n1
–i
()
–[
] 2
i∑n
∑
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 7
Info
rmatio
n retrieval:
Text do
cum
ent IR
•G
iven q
uery term
s , d
ocu
men
t terms
ho
w to
fin
d an
d ran
k do
cum
ents?
•S
tand
ard IR
uses ‘inverted
ind
ex’ to fi
nd
:
-one entry per term
, listing all docum
ents
containing that term
•D
ocu
men
ts are ranked
usin
g “tf • id
f”
-
tf
(term frequency) =
how often term
is in doc-
idf
(inverse document frequency)
= how
many (how
few) docs contain term
•P
erform
ance m
easures
-precision: (correct found)/(all found)
-recall: (correct found)/(all correct)
-m
ean reciprocal rank - for specific targets
Tq
TD
i()
TD
Di()
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 8
Qu
eries in T
hisl
•O
rigin
al idea: sp
eech in
, speech
ou
t
•Try to
‘un
derstan
d’ q
ueries
-hand-built gram
mar:
-.. but keyw
ords better
•P
ho
netic m
atchin
g w
ith sp
eech in
pu
t
-search ‘phone lattice’ recognizer output?
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 9
Th
isl User In
terface
Date filters
Pro
gram
Pau
ses &sen
tence
sho
wn
breaks
Sp
eechin
pu
t
filter
click-to-p
lay
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 10
Th
isl SD
R p
erform
ance
•N
IST
Text Retrieval C
on
ference (T
RE
C),
Sp
oken
Do
cum
ents track
-500 hours of data
→
need fast recognition-
set of ‘evaluation queries’ + relevance judgm
ents
•C
om
po
nen
ts tried in
differen
t com
bin
ation
s
-different speech transcripts (subtitles, A
SR
)-
different IR engines &
query processing
•P
erform
ance o
f systems
-A
SR
less important than IR
(query expansion...)
1015
2025
3035
0.4
0.5
WE
R / %
With story boundaries
No story boundaries
Average Precision
TR
EC
9 Thisl results
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 11
Sp
eaker Iden
tificatio
n
•C
om
plem
ent to
speech
recog
nitio
n:
Iden
tify the sp
eaker, regard
less of th
e wo
rds
•D
ifferent fo
rms o
f the p
rob
lem:
-speaker segm
entation-
speaker identification-
speaker verification
•Facto
rs:
-am
ount of training data (10 s .. 20 min)
-am
ount of test data (3 s .. 5 min)
-num
ber of competitors (10 .. 500)
-false accept vs. false reject
•S
tand
ard b
aseline
-large “universal background m
odel” (UB
M)
(e.g. 2000 mixture G
MM
on MF
CC
s)-
likelihood ratio to speaker-specific model
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 12
“Su
per S
peaker ID
”
•M
FC
C featu
res do
n’t cap
ture ‘h
igh
level’ info
•2002 JH
U p
roject to
investig
ate new
featu
res
-e.g. com
bined pitch/energy contour sequences:
-also phone ftrs...
•Favo
rable fu
sion
with
stan
dard
baselin
e
http://ww
w.clsp.jhu.edu/w
s02/groups/supersid/
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 13
Ou
tline
Sp
oken
Do
cum
ent R
etrieval
Au
dio
datab
ases
-N
onspeech audio retrieval-
Personal audio archives
Op
en issu
es
123
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 14
Real-w
orld
aud
io
•S
peech
is on
ly part o
f the au
dio
wo
rld
-w
ord transcripts are not the whole story
•L
arge au
dio
datasets
-m
ovie & T
V soundtracks
-events such as sports, new
s ‘actualities’-
situation-based audio ‘awareness’
-personal audio recording
•In
form
ation
from
sou
nd
-speaker identity, m
ood, interactions-
‘events’: explosions, car tires, bounces...-
ambience: party, subw
ay, woods
•A
pp
lication
s
-indexing, retrieval
-description/sum
marization
-intelligent reaction
2
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 15
Mu
ltimed
ia Descrip
tion
: MP
EG
-7
•M
PE
G h
as pro
du
ced stan
dard
s for au
dio
/ vid
eo d
ata com
pressio
n (M
PE
G-1/2/4)
•M
PE
G-7 is a stan
dard
for m
etadata:
describ
ing
mu
ltimed
ia con
tent
-because search and retrieval are so im
portant
•D
efin
es descrip
tion
s of tim
e-specifi
c tags,
ways to
defi
ne categ
ories,
specifi
c catego
ry instan
ces
•+ P
relimin
ary feature d
efin
ition
s e.g. fo
r aud
io:
-spectrum
: centroid, spread, flatness-
harmonicity: degree, stability
-pitch, attack tim
e, melody structure ...
http://ww
w.darm
stadt.gmd.de/m
obile/MP
EG
7/Docum
ents.html
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 16
Mu
scle Fish
“So
un
dF
isher”
•A
ccess to so
un
d effects d
atabases
•F
eatures (tim
e series con
tou
rs):-
loudness, brightness, pitch, cepstra
•Q
uery-by-exam
ple
-direct correlation of contours (norm
alized/not)-
comparison of value histogram
s (time-collapsed)
•A
lways g
lob
al features
-a m
ixture of two sounds looks like neither
Segm
ent�feature�analysis
Sound segm
entdatabase
Segm
ent�feature�analysis
Seach/�
comparison
Results
Query exam
ple
Feature vectors
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 17
So
un
dF
isher u
ser interface
•P
rincip
le qu
ery mech
anism
is “sou
nd
s like”
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 18
HM
M m
od
eling
of n
on
speech
•N
o su
b-u
nits d
efin
ed fo
r no
nsp
eech so
un
ds
-but can still train H
MM
s with E
M
•F
inal states d
epen
d o
n E
M in
itialization
-labels / clusters
-transition m
atrix
•H
ave ideas o
f wh
at we’d
like to g
et-
investigate features/initialization to get there
s4s7
s3s4
s3s4
s2s3
s2s5
s3s2
s7s3
s2s4
s2s3
s7
1.
15
1.
20
1.
25
1.
30
1.
35
1.
40
1.
45
1.
50
1.
55
1.
60
1.
65
1.
70
1.
75
1.
80
1.
85
1.
90
1.
95
2.
00
2.
05
2.
10
2.
15
2.
20
2.
25
2
s4s7
s3s4
s3s4
s2s3
s2s5
s3s2
s7s3
s2s4
s2s3
s71 2 3 4 5 6 7 8
1.
15
1.
20
1.
30
1.
40
1.
50
1.
60
1.
70
1.
80
1.
90
2.
00
2.
10
2.
20
2tim
e
freq / kHz
do
gB
arks2
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 19
Ind
exing
for so
un
dtracks
•A
ny real-wo
rld au
dio
will h
ave m
ultip
le simu
ltaneo
us so
un
d so
urces
•Q
ueries typ
ically relate to o
ne so
urce o
nly
-not a source in a particular context
•N
eed to
ind
ex accord
ing
ly:
-analyze sound into source-related elem
ents-
perform search &
match in that dom
ain
Segm
ent�feature�analysis
Continuous audio
archive
Segm
ent�feature�analysis
Seach/�
comparison
Results
Query exam
ple
Elem
ent representations
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 20
Alarm
sou
nd
detectio
n
•A
larm so
un
ds h
ave particu
lar structu
re-
people ‘know them
when they hear them
’
•Iso
late alarms in
sou
nd
mixtu
res
-sinusoid peaks have invariant properties
-cepstral coefficients are easy to m
odel
freq / Hz
11.5
22.5
0
1000
2000
3000
4000
5000
time / sec
11.5
22.5
0
1000
2000
3000
4000
50001
11.5
22.5
0
1000
2000
3000
4000
5000
time / sec
freq / HzPr(alarm)
01
23
45
67
89
0
2000
4000
0.5 1
Speech +
alarm
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 21
Perso
nal A
ud
io
•L
ifeLo
g / M
yLifeB
its / R
emem
bran
ce Ag
ent:
Easy to
record
everythin
g yo
u
hear
•T
hen
wh
at?-
prohibitively time consum
ing to search
-but .. applications if access easier
•A
uto
matic co
nten
t analysis / in
dexin
g...
50100
150200
2500 2 4
14:3015:00
15:3016:00
16:3017:00
17:3018:00
18:30
5 10 15
freq / Hzfreq / Bark
time / m
in
clock time
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 22
Seg
men
ting
Perso
nal A
ud
io
•F
irst step: seg
men
t into
con
sistent ‘ep
isod
es’
•V
ariety of featu
res:-
regular spectrum-
auditory spectrum-
MF
CC
s-
subband entropy
•M
ean/varian
ce over 1 min
segm
ents + B
IC
•B
est: 84% co
rrect detect @
2% false alarm
-m
ean audspec energy + entropy
freq / Bark
Clock tim
e
Hand-m
arkedboundaries
5 10 15 20
21:0022:00
23:000:00
1:002:00
3:004:00
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 23
Detectin
g S
peech
Seg
men
ts
•S
egm
ents w
ith sp
eech are m
ost in
teresting
-high noise defeats V
oice Activity D
etection
•Vo
ice Pitch
as the stro
ng
est cue?
-periodicity +
speech dynamics
-need noise-robust pitch tracker
•Im
proved
detectio
n in
no
ise
Cochlea
filter
-100
-50
0
freq / kHz
Personal A
udio - Speech +
Noise
0 2 4 6 80 50
100
150
200
Lags
Pitch T
rack + S
peaker Active G
round Truth
01
23
45
67
89
10tim
e/sec
level / dB
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 24
Ou
tline
Sp
oken
do
cum
ent retrieval
Au
dio
datab
ases
Op
en issu
es-
Speech recognition
-S
ound source separation-
Information extraction &
visualization-
Learning from audio
123
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 25
Op
en issu
es 1:S
peech
recog
nitio
n
•S
peech
recog
nitio
n is g
oo
d &
imp
roving
-but reaching asym
ptote: BN
WE
R:
1997=22%
1999=14%
2004=9%
-training data: 1999=
150h 2004=3500h
•P
rob
lem areas:
-noisy speech (m
eetings, cellphones)-
informal speech (casual conversations)
-speaker variations (style, accent)
•Is th
e curren
t app
roach
correct?
-M
FC
C-G
MM
-HM
M system
s are optimized
-new
approaches can’t compete
-but: independence, classifier, H
MM
s...
3
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 26
Op
en issu
es 2:S
ou
nd
mixtu
res
•R
eal-wo
rld so
un
d alw
ays con
sists of m
ixtures
-w
e experience it in terms of separate sources
-‘intelligent’ system
s must do the sam
e
•H
ow
to sep
arate sou
nd
sou
rces?-
exact decomposition (‘blind source separation’)
-extract cues
-overlap, m
asking →
top-down approaches, analysis-by-synthesis
•H
ow
to rep
resent &
recog
nize so
urces?
-w
hich features, attributes?-
hierarchy of general-to-specific classes...
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 27
Op
en issu
es 3:In
form
ation
& visu
alization
•S
pectro
gram
s are OK
for sp
eech,
often
un
satisfactory fo
r mo
re com
plex so
un
ds
-frequency axis, intensity axis – tim
e axis?-
separate spatial, pitch, source dimensions
•V
isualizatio
n m
ay no
t be p
ossib
le.. bu
t help
s us th
ink ab
ou
t sou
nd
features
•D
ifferent rep
resentatio
ns fo
r differen
t aspects
-best for speech, m
usic, environmental, ...
time / s
freq / Hz
01
23
45
67
89
0
1000
2000
3000
4000
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 28
Op
en issu
es 4:L
earnin
g fro
m au
dio
•H
MM
s (EM
, Bau
m-W
elch etc.) h
ave had
a h
ug
e imp
act on
speech
, han
dw
riting
...-
very good for optimizing m
odels-
little help for determining m
odel structure
•A
pp
licable to
oth
er aud
io tasks?
-e.g. textures, am
bience, vehicles, instruments
•P
rob
lems:
-finding the right m
odel structures-
constraining what the m
odels learn:initial clustering, target labelling
•H
ow
to leverag
e large d
atabases, bu
lk aud
io-
unsupervised acquisition of classes, features-
the analog of infant development
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 29
Ou
tline
Sp
oken
Do
cum
ent R
etrieval
Au
dio
Datab
ases
Op
en issu
es
123
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 30
Co
urse retro
spective
Fundam
entals
L1:D
SP
L2:A
cou
sticsL3:
Pattern
recog
nitio
n
L4:A
ud
itory
percep
tion
Aud
io processing
L5:S
ign
alm
od
els
L6:M
usic
analysis/
synth
esis
L7:A
ud
io
com
pressio
n
L8:S
patial so
un
d&
rend
ering
Ap
plications
L9:S
peech
recog
nitio
n
L10:M
usic
retrieval
L11:S
ign
al sep
aration
L12:M
ultim
edia
ind
exing
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 31
Su
mm
ary
•L
arge Vo
cabulary sp
eech reco
gn
ition
-errors are O
K for indexing
-.. but still needs controlled audio quality
•R
ecog
nizin
g n
on
speech
aud
io-
lots of other kinds of acoustic events-
speech-style recognition can be applied
•O
pen
qu
estion
s-
lots of things that we don’t know
E6820 S
AP
R - D
an Ellis
L12 - Indexing2007-04-12 - 32
Fin
al Presen
tation
s
•Tw
o S
ession
sT
hursday April 26th, 10:00-12:30
Thursday M
ay 3rd, 10:00-12:30
•20 m
inu
te slots
e.g. 15 minute talk, 5 m
in Qs / discussion
•B
ackgro
un
d!
Resu
lts!E
xamp
les!+
Discussion!
•S
pecial A
V req
uirem
ents?
Let me know
...