spoken document retrieval multimedia indexingdpwe/e6820/lectures/e6820-l12-apps.pdfe6820 sapr - dan...

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 1

EE

E6820: S

peech & A

udio Processing &

Recognition

Lectu

re 12:M

ultim

edia In

dexin

g

Sp

oken

do

cum

ent retrieval

Au

dio

datab

ases

Op

en issu

es

Dan E

llis <dpw

[email protected]

bia.edu>http://w

ww

.ee.columbia.edu/~

dpwe/e6820/

Colum

bia University D

ept. of Electrical E

ngineeringS

pring 2007

123

http://www.ee.columbia.edu/~dpwe/e6820/

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 2

Sp

oken

Do

cum

ent R

etrieval (SD

R)

•20%

WE

R is h

orrib

le for tran

scriptio

n

-is it good for anything else?

•In

form

ation

Retrieval (IR

)

-T

RE

C/M

UC

‘spoken documents’

-tolerant of w

ord error rate, e.g.:

F0:

TH

E V

ER

Y E

AR

LY R

ET

UR

NS

OF

TH

E N

ICA

RA

GU

AN

PR

ES

IDE

NT

IAL E

LEC

TIO

N

SE

EM

ED

TO FA

DE

BE

FO

RE

TH

E LO

CA

L MAY

OR

ON

A LO

T O

F LA

WF

4:AT

TH

IS S

TAG

E O

F T

HE

AC

CO

UN

TIN

G F

OR

SE

VE

NT

Y S

CO

TC

H O

NE

LEA

DE

R

DA

NIE

L OR

TE

GA

IS IN

SE

CO

ND

PLA

CE

TH

ER

E W

ER

E T

WE

NT

Y T

HR

EE

P

RE

SID

EN

TIA

L CA

ND

IDAT

ES

OF

TH

E E

LEC

TIO

NF

5:T

HE

LAB

OR

MIG

HT

DO

WE

LL TO R

EM

EM

BE

R T

HE

LOS

T A

MA

JOR

EP

ISO

DE

OF

T

RA

NS

ATLA

NT

IC C

ON

NE

CT

TO A

CO

RP

OR

ATIO

N IN

BO

TH

CO

NS

ER

VAT

IVE

PAR

TY

O

FF

ICIA

LS F

RO

M B

RITA

IN G

OIN

G TO

WA

SH

ING

TON

TH

EY

WE

NT

TO W

OO

D B

UY

S

GE

OR

GE

BU

SH

ON

HO

W TO

WIN

A S

EC

ON

D TO

NO

NE

IN LO

ND

ON

TH

IS IS

S

TE

PH

EN

BE

AR

D F

OR

MA

RK

ET

PLA

CE

•P

rom

ising

app

lication

area

-docum

ent retrieval already hit-and-miss

-plenty of untranscribed m

aterial

1

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 3

Th

e TH

ISL

SD

R system

•O

rigin

al task: BB

C n

ewsro

om

sup

po

rt

•H

ow

to bu

ild th

e datab

ase:

-autom

atically record news program

s ‘off air’-

several hours per day

→

> 3,000 hrs

-run recognition the w

hole time

-problem

s storing audio!

Contro

l

Text

Audio

Vid

eo

Rece

iver

AS

R

NLP

Segm

enta

tion

AS

RIR

Da

tab

ase

Arch

iveQ

ue

ry

http

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 4

Bu

ildin

g a n

ew reco

gn

izer

•N

o m

od

els available fo

r BB

C E

ng

lish

-need to develop a new

recognizer based on US

E

nglish Broadcast N

ews, read B

ritish English...

•Train

ing

set: M

anu

al transcrip

tion

of 40 h

ou

rs of n

ews

-w

ord-level transcription takes > 10x real-tim

e-

Viterbi training, starting from

read speech model

•L

ang

uag

e mo

del:

200M w

ord

s of U

S &

UK

new

spap

er archives

•D

iction

ary: S

tand

ard U

K-E

ng

lish + exten

sion

s

-m

any novel & foreign w

ords

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 5

Vocabu

lary extensio

n

•N

ews alw

ays has n

ovel wo

rds

•S

tarting

po

int: Text-to

-speech

rules

-speech synthesizers’ rules for unknow

n words

-but novel w

ords are often foreign names

•S

ou

rces to id

entify n

ew w

ord

s

-B

BC

‘house style’ information

•C

ho

ose m

od

el by sing

le acou

stic examp

le

-grab from

TV

subtitles?

dhaxiy

dcld

daa

aw...

FS

G decoder .. .

Wordstrings

Dictionary

Phone trees

Letter-to-phonetrees

The D

ow Jones...

d o w

ax d aw

oror

FS

G building

FS

GA

coustics

dh dh dh dh ax ax dcl d...A

lignment

the=dh ax

dow=d aw�

jones=jh

Acoustic confidences

dh= 0.24 ax=

0.4 dcl=0.81...

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 6

Au

dio

segm

entatio

n

•B

road

cast aud

io in

clud

es mu

sic, no

ise etc.

•S

egm

entatio

n is im

po

rtant fo

r recog

nitio

n

-speaker identity tagging, m

odel adaptation-

excluding nonspeech segments

•C

an u

se gen

eric mo

dels o

f similarity/d

ifference

•L

oo

k at statistics of sp

eech m

od

el ou

tpu

t

-e.g. dynam

ism

00.05

0.10.15

0.20.25

0.3D

ynamism

0

0.5 1

1.5 2

2.5 3

3.5

Entropy

Speech

Music

Speech+

Music

0

2000

4000

frq/Hz0

24

68

1012

time/s

0 20 40 Sp

ectrog

ram

Po

steriors

speechm

usicspeech+

music

1N ----p

qn i

()

pq

n1

–i

()

–[

] 2

i∑n

∑

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 7

Info

rmatio

n retrieval:

Text do

cum

ent IR

•G

iven q

uery term

s , d

ocu

men

t terms

ho

w to

fin

d an

d ran

k do

cum

ents?

•S

tand

ard IR

uses ‘inverted

ind

ex’ to fi

nd

:

-one entry per term

, listing all docum

ents

containing that term

•D

ocu

men

ts are ranked

usin

g “tf • id

f”

-

tf

(term frequency) =

how often term

is in doc-

idf

(inverse document frequency)

= how

many (how

few) docs contain term

•P

erform

ance m

easures

-precision: (correct found)/(all found)

-recall: (correct found)/(all correct)

-m

ean reciprocal rank - for specific targets

Tq

TD

i()

TD

Di()

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 8

Qu

eries in T

hisl

•O

rigin

al idea: sp

eech in

, speech

ou

t

•Try to

‘un

derstan

d’ q

ueries

-hand-built gram

mar:

-.. but keyw

ords better

•P

ho

netic m

atchin

g w

ith sp

eech in

pu

t

-search ‘phone lattice’ recognizer output?

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 9

Th

isl User In

terface

Date filters

Pro

gram

Pau

ses &sen

tence

sho

wn

breaks

Sp

eechin

pu

t

filter

click-to-p

lay

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 10

Th

isl SD

R p

erform

ance

•N

IST

Text Retrieval C

on

ference (T

RE

C),

Sp

oken

Do

cum

ents track

-500 hours of data

→

need fast recognition-

set of ‘evaluation queries’ + relevance judgm

ents

•C

om

po

nen

ts tried in

differen

t com

bin

ation

s

-different speech transcripts (subtitles, A

SR

)-

different IR engines &

query processing

•P

erform

ance o

f systems

-A

SR

less important than IR

(query expansion...)

1015

2025

3035

0.4

0.5

WE

R / %

With story boundaries

No story boundaries

Average Precision

TR

EC

9 Thisl results

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 11

Sp

eaker Iden

tificatio

n

•C

om

plem

ent to

speech

recog

nitio

n:

Iden

tify the sp

eaker, regard

less of th

e wo

rds

•D

ifferent fo

rms o

f the p

rob

lem:

-speaker segm

entation-

speaker identification-

speaker verification

•Facto

rs:

-am

ount of training data (10 s .. 20 min)

-am

ount of test data (3 s .. 5 min)

-num

ber of competitors (10 .. 500)

-false accept vs. false reject

•S

tand

ard b

aseline

-large “universal background m

odel” (UB

M)

(e.g. 2000 mixture G

MM

on MF

CC

s)-

likelihood ratio to speaker-specific model

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 12

“Su

per S

peaker ID

”

•M

FC

C featu

res do

n’t cap

ture ‘h

igh

level’ info

•2002 JH

U p

roject to

investig

ate new

featu

res

-e.g. com

bined pitch/energy contour sequences:

-also phone ftrs...

•Favo

rable fu

sion

with

stan

dard

baselin

e

http://ww

w.clsp.jhu.edu/w

s02/groups/supersid/

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 13

Ou

tline

Sp

oken

Do

cum

ent R

etrieval

Au

dio

datab

ases

-N

onspeech audio retrieval-

Personal audio archives

Op

en issu

es

123

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 14

Real-w

orld

aud

io

•S

peech

is on

ly part o

f the au

dio

wo

rld

-w

ord transcripts are not the whole story

•L

arge au

dio

datasets

-m

ovie & T

V soundtracks

-events such as sports, new

s ‘actualities’-

situation-based audio ‘awareness’

-personal audio recording

•In

form

ation

from

sou

nd

-speaker identity, m

ood, interactions-

‘events’: explosions, car tires, bounces...-

ambience: party, subw

ay, woods

•A

pp

lication

s

-indexing, retrieval

-description/sum

marization

-intelligent reaction

2

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 15

Mu

ltimed

ia Descrip

tion

: MP

EG

-7

•M

PE

G h

as pro

du

ced stan

dard

s for au

dio

/ vid

eo d

ata com

pressio

n (M

PE

G-1/2/4)

•M

PE

G-7 is a stan

dard

for m

etadata:

describ

ing

mu

ltimed

ia con

tent

-because search and retrieval are so im

portant

•D

efin

es descrip

tion

s of tim

e-specifi

c tags,

ways to

defi

ne categ

ories,

specifi

c catego

ry instan

ces

•+ P

relimin

ary feature d

efin

ition

s e.g. fo

r aud

io:

-spectrum

: centroid, spread, flatness-

harmonicity: degree, stability

-pitch, attack tim

e, melody structure ...

http://ww

w.darm

stadt.gmd.de/m

obile/MP

EG

7/Docum

ents.html

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 16

Mu

scle Fish

“So

un

dF

isher”

•A

ccess to so

un

d effects d

atabases

•F

eatures (tim

e series con

tou

rs):-

loudness, brightness, pitch, cepstra

•Q

uery-by-exam

ple

-direct correlation of contours (norm

alized/not)-

comparison of value histogram

s (time-collapsed)

•A

lways g

lob

al features

-a m

ixture of two sounds looks like neither

Segm

ent�feature�analysis

Sound segm

entdatabase

Segm


Seach/�

comparison

Results

Query exam

ple

Feature vectors

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 17

So

un

dF

isher u

ser interface

•P

rincip

le qu

ery mech

anism

is “sou

nd

s like”

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 18

HM

M m

od

eling

of n

on

speech

•N

o su

b-u

nits d

efin

ed fo

r no

nsp

eech so

un

ds

-but can still train H

MM

s with E

M

•F

inal states d

epen

d o

n E

M in

itialization

-labels / clusters

-transition m

atrix

•H

ave ideas o

f wh

at we’d

like to g

et-

investigate features/initialization to get there

s4s7

s3s4

s3s4

s2s3

s2s5

s3s2

s7s3

s2s4

s2s3

s7

1.

15

1.

20

1.

25

1.

30

1.

35

1.

40

1.

45

1.

50

1.

55

1.

60

1.

65

1.

70

1.

75

1.

80

1.

85

1.

90

1.

95

2.

00

2.

05

2.

10

2.

15

2.

20

2.

25

2

s4s7

s3s4

s3s4

s2s3

s2s5

s3s2

s7s3

s2s4

s2s3

s71 2 3 4 5 6 7 8

1.

15

1.

20

1.

30

1.

40

1.

50

1.

60

1.

70

1.

80

1.

90

2.

00

2.

10

2.

20

2tim

e

freq / kHz

do

gB

arks2

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 19

Ind

exing

for so

un

dtracks

•A

ny real-wo

rld au

dio

will h

ave m

ultip

le simu

ltaneo

us so

un

d so

urces

•Q

ueries typ

ically relate to o

ne so

urce o

nly

-not a source in a particular context

•N

eed to

ind

ex accord

ing

ly:

-analyze sound into source-related elem

ents-

perform search &

match in that dom

ain

Segm


Continuous audio

archive

Segm


Seach/�

comparison

Results

Query exam

ple

Elem

ent representations

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 20

Alarm

sou

nd

detectio

n

•A

larm so

un

ds h

ave particu

lar structu

re-

people ‘know them

when they hear them

’

•Iso

late alarms in

sou

nd

mixtu

res

-sinusoid peaks have invariant properties

-cepstral coefficients are easy to m

odel

freq / Hz

11.5

22.5

0

1000

2000

3000

4000

5000

time / sec

11.5

22.5

0

1000

2000

3000

4000

50001

11.5

22.5

0

1000

2000

3000

4000

5000

time / sec

freq / HzPr(alarm)

01

23

45

67

89

0

2000

4000

0.5 1

Speech +

alarm

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 21

Perso

nal A

ud

io

•L

ifeLo

g / M

yLifeB

its / R

emem

bran

ce Ag

ent:

Easy to

record

everythin

g yo

u

hear

•T

hen

wh

at?-

prohibitively time consum

ing to search

-but .. applications if access easier

•A

uto

matic co

nten

t analysis / in

dexin

g...

50100

150200

2500 2 4

14:3015:00

15:3016:00

16:3017:00

17:3018:00

18:30

5 10 15

freq / Hzfreq / Bark

time / m

in

clock time

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 22

Seg

men

ting

Perso

nal A

ud

io

•F

irst step: seg

men

t into

con

sistent ‘ep

isod

es’

•V

ariety of featu

res:-

regular spectrum-

auditory spectrum-

MF

CC

s-

subband entropy

•M

ean/varian

ce over 1 min

segm

ents + B

IC

•B

est: 84% co

rrect detect @

2% false alarm

-m

ean audspec energy + entropy

freq / Bark

Clock tim

e

Hand-m

arkedboundaries

5 10 15 20

21:0022:00

23:000:00

1:002:00

3:004:00

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 23

Detectin

g S

peech

Seg

men

ts

•S

egm

ents w

ith sp

eech are m

ost in

teresting

-high noise defeats V

oice Activity D

etection

•Vo

ice Pitch

as the stro

ng

est cue?

-periodicity +

speech dynamics

-need noise-robust pitch tracker

•Im

proved

detectio

n in

no

ise

Cochlea

filter

-100

-50

0

freq / kHz

Personal A

udio - Speech +

Noise

0 2 4 6 80 50

100

150

200

Lags

Pitch T

rack + S

peaker Active G

round Truth

01

23

45

67

89

10tim

e/sec

level / dB

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 24

Ou

tline

Sp

oken

do

cum

ent retrieval

Au

dio

datab

ases

Op

en issu

es-

Speech recognition

-S

ound source separation-

Information extraction &

visualization-

Learning from audio

123

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 25

Op

en issu

es 1:S

peech

recog

nitio

n

•S

peech

recog

nitio

n is g

oo

d &

imp

roving

-but reaching asym

ptote: BN

WE

R:

1997=22%

1999=14%

2004=9%

-training data: 1999=

150h 2004=3500h

•P

rob

lem areas:

-noisy speech (m

eetings, cellphones)-

informal speech (casual conversations)

-speaker variations (style, accent)

•Is th

e curren

t app

roach

correct?

-M

FC

C-G

MM

-HM

M system

s are optimized

-new

approaches can’t compete

-but: independence, classifier, H

MM

s...

3

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 26

Op

en issu

es 2:S

ou

nd

mixtu

res

•R

eal-wo

rld so

un

d alw

ays con

sists of m

ixtures

-w

e experience it in terms of separate sources

-‘intelligent’ system

s must do the sam

e

•H

ow

to sep

arate sou

nd

sou

rces?-

exact decomposition (‘blind source separation’)

-extract cues

-overlap, m

asking →

top-down approaches, analysis-by-synthesis

•H

ow

to rep

resent &

recog

nize so

urces?

-w

hich features, attributes?-

hierarchy of general-to-specific classes...

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 27

Op

en issu

es 3:In

form

ation

& visu

alization

•S

pectro

gram

s are OK

for sp

eech,

often

un

satisfactory fo

r mo

re com

plex so

un

ds

-frequency axis, intensity axis – tim

e axis?-

separate spatial, pitch, source dimensions

•V

isualizatio

n m

ay no

t be p

ossib

le.. bu

t help

s us th

ink ab

ou

t sou

nd

features

•D

ifferent rep

resentatio

ns fo

r differen

t aspects

-best for speech, m

usic, environmental, ...

time / s

freq / Hz

01

23

45

67

89

0

1000

2000

3000

4000

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 28

Op

en issu

es 4:L

earnin

g fro

m au

dio

•H

MM

s (EM

, Bau

m-W

elch etc.) h

ave had

a h

ug

e imp

act on

speech

, han

dw

riting

...-

very good for optimizing m

odels-

little help for determining m

odel structure

•A

pp

licable to

oth

er aud

io tasks?

-e.g. textures, am

bience, vehicles, instruments

•P

rob

lems:

-finding the right m

odel structures-

constraining what the m

odels learn:initial clustering, target labelling

•H

ow

to leverag

e large d

atabases, bu

lk aud

io-

unsupervised acquisition of classes, features-

the analog of infant development

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 29

Ou

tline

Sp

oken

Do

cum

ent R

etrieval

Au

dio

Datab

ases

Op

en issu

es

123

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 30

Co

urse retro

spective

Fundam

entals

L1:D

SP

L2:A

cou

sticsL3:

Pattern

recog

nitio

n

L4:A

ud

itory

percep

tion

Aud

io processing

L5:S

ign

alm

od

els

L6:M

usic

analysis/

synth

esis

L7:A

ud

io

com

pressio

n

L8:S

patial so

un

d&

rend

ering

Ap

plications

L9:S

peech

recog

nitio

n

L10:M

usic

retrieval

L11:S

ign

al sep

aration

L12:M

ultim

edia

ind

exing

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 31

Su

mm

ary

•L

arge Vo

cabulary sp

eech reco

gn

ition

-errors are O

K for indexing

-.. but still needs controlled audio quality

•R

ecog

nizin

g n

on

speech

aud

io-

lots of other kinds of acoustic events-

speech-style recognition can be applied

•O

pen

qu

estion

s-

lots of things that we don’t know

E6820 S

AP

R - D

an Ellis

L12 - Indexing2007-04-12 - 32

Fin

al Presen

tation

s

•Tw

o S

ession

sT

hursday April 26th, 10:00-12:30

Thursday M

ay 3rd, 10:00-12:30

•20 m

inu

te slots

e.g. 15 minute talk, 5 m

in Qs / discussion

•B

ackgro

un

d!

Resu

lts!E

xamp

les!+

Discussion!

•S

pecial A

V req

uirem

ents?

Let me know

...

spoken document retrieval multimedia indexingdpwe/e6820/lectures/e6820-l12-apps.pdfe6820 sapr - dan...

Documents