mediaeval 2013 spoken web search results slides

Spoken Web Search at Mediaeval 2013

Xavier Anguera, Florian Metze, Andi Buzo, Igor Szoke and Luis Javier

Rodriguez-Fuentes

Spoken Audio Search (or Query-by-Example Spoken-Term Detection)

Given a spoken query we search for instances at lexical level within spoken documentsIt is similar to Spoken Term Detection (NIST STD2006, OpenKWS 2013) but…

Queries are spoken

Different speakers

Different acoustic conditions

No prior knowledge of the

language(s) might be available

SWS history in Mediaeval• SWS 2011 had 5 finishing participants and

focused on 4 Indian languages• SWS 2012 had 9 finishing participants and

focused on 4 African Languages• SWS 2013 has 13 finishing (18 registered)

participants and contains 9 languages

2011 2012 20130

2

4

6

8

10

12

14

16

18

0

200

400

600

800

1000

1200

1400#teams

database size

SWS 2013 evaluation setup

• 1 single search corpus with ~20 hours of data, collected from contributions of 9 languages– No transcription or language information is given

to participants• 500 queries for dev and 500 queries for eval– For each query, participants need to return all

instances of that query in the search corpus

Mediaeval SWS 2013• 9 languages in different acoustic contexts: 4 African

languages (isixhosa, isizulu, sepedi, setswana), Albanian, Basque, Czech, non-native English, Romanian

#utts time Avg. length/utt.

Search corpus 10762 19:57:55 6.67s

Dev Queries 505 0:11:26h 1.35s

Extended dev* 1046 0:08:42h 0.49s

Eval Queries 503 0:11:37h 1.38s

Extended eval* 1037 0:08:57h 0.51s

Total 13853 20:38:37h*Only Basque (3x) and Czech (10x) queries have extended versions

Database distribution per language

Language Number of utterances / total duration

Number of queries Speech quality (original sampling rate)

Recording environment

African - isixhosa 395 / 60 min. 25 / 25 Telephone speech, 8KHz Field recordings, read speech

African - isizulu 395 / 60 min. 25 / 25 Telephone speech, 8KHz Field recordings, read speech

African - sepedi 395 / 60 min. 25 / 25 Telephone speech, 8KHz Field recordings, read speech

African - setswana 395 / 60 min. 25 / 25 Telephone speech, 8KHz Field recordings, read speech

Albanian 968 / 127 min. 50 / 50 PC microphone, 16KHz Lab environment, read speech

Basque 1841 / 192 min. 100 / 100 (recorded by mobile phone)

TV Broadcast news, 16KHz

Studio, read speech

Czech 3667 / 252 min. 94 / 93 Telephone speech, 8KHz Telephone calls into radio broadcasts, spontaneous speech

Non-native English 434 / 141 min. 61 / 60 High quality mic, 44KHz Conference lectures, spontaneous speech

Romanian 2272 / 244 min. 100 / 100 PC microphone, 16KHz Lab environment, read speech

SWS 2013 participantsTeam name countryDto. Electricidad y electrónica, Universidad Pais Vasco SpainSpeec@FIT, Brno University of Technology Czech RepublicTelefonica Research SpainUniversity Politechnica of Bucarest RomaniaSchool of Electrical and Computer Engineering, Georgia Institute of Technology USAL2F - INESC-ID PortugalDepartament de sistemes informàtics I Computació, Universitat Politècnica de València SpainAudiolab, University of Zilina SlovakiaLIA, University of Avignon FranceTechnical University of Kosice SlovakiaUniversitat Pompeu Fabra SpainDSP-STL, Dept. of EE, The chinese University of Hong Kong Hong KongInternational Institute of Information Technology- Hyderabad IndiaIAIS, Fraunhofer Institute GermanyTATA Consultancy Services Ltd. IndiaIndian Statistical Institute IndiaNorthwestern Polytechnical University of Xi’an ChinaToyota Technological Institute at Chicago USA

orga

nize

rsN

on-fi

nish

ing

Possible approaches to QbE-STD

Pattern based

Lattice based

Word-based

Language spokenAcoustic models

Language models

+

+

Dynamic Tim

e Warping

Acoustic Keyword

Spotting

Full ASR

Followed approachesTeam name DTW-like AKWSDto. Electricidad y electrónica, Universidad Pais VascoSpeec@FIT, Brno University of TechnologyTelefonica ResearchUniversity Politechnica of BucarestSchool of Electrical and Computer Engineering, Georgia Institute of TechnologyL2F - INESC-IDDept. de sistemes informàtics I Computació, Universitat Politècnica de ValènciaAudiolab, University of ZilinaLIA, University of AvignonTechnical University of KosiceUniversitat Pompeu FabraDSP-STL, Dept. of EE, The chinese University of Hong KongInternational Institute of Information Technology- Hyderabad

Scoring metrics

• PRIMARY: Actual Term Weighted Value (ATWV) / Maximum Term Weighted Value (MTWV)

• Actual/minimum Cnxe

• Real-time factor• Memory usage

Primary metric (dev)

Primary metric (eval)

Per language resultsAverage for the 10-best systems

Per-language results: African (eval)

Per-language results: Albanian(eval)

Per-language results: Basque(eval)

Per-language results: Czech (eval)

Per-language results: Non-native English (eval)

Per-language results: Romanian (eval)

DET dev

DET eval

Cnxe metric

Extended Queries

• 4 teams submitted 4 extended systems, making use of 3 repetitions of Basque queries and 10 repetitions of Czech queries available– TID: computes each query individually and then puts together all

results– GTTS: DTW-aligns all queries above a minimum duration and searches

with the resulting query– GeorgiaTech: builds a graphical keyword model using more than one

instance

Extended systems

Real-Time Factor versus Memory usage

Real-Time Factor versus Memory usage (partial)

Take home messages

• The task was more complicated than in 2012– GTTS got MTWV-13 = 0.39 MTWV-12 = 0.51 (on

2013 data)– HKCU MTWV-12 = 0.74 (on 2012 data)

• It is possible to do QbE-STD on unknown/low resources data

New things to watch out for in the posters session• BUT:

– Fusion of 26 systems (13 AKWS + 13 DTW)– M-norm normalization

• IIIT:– Articulatory Bottleneck features

• CUHK:– Tokenizer construction using Gaussian Component clustering– Query expansion using PSOLA

• L2F– DTW candidate pre-selection

• GTTS:– Distance matrix normalization in DTW

• GeorgiaTech:– Low-resource speech modeling using EHMM Models

• LIA:– Use of I-vectors in SWS

• ARF– DTW string matching algorithm with a novel scoring

Poster session

System presentations

• 16:30-16:45 "GTTS Systems for the SWS Task at MediaEval 2013", Luis Javier Rodriguez-Fuentes, DEE, Universidad del País Vasco

• 16:45-17:00 "The L2F Spoken Web Search system for Mediaeval 2013”, Alberto Abad, L2F, INESC-ID

• 17:00-17:15 "BUT SWS 2013 - MASSIVE PARALLEL APPROACH", Lucas Ondel, Speech@BUT, Brno University of Technology

• 17:15-17:30 "The CMTECH Spoken Web Search System for MediaEval 2013", Ciro Gracia, UPF

• 17:30-17:45 Discussion and SWS 2014 teaser, Xavier Anguera

mediaeval 2013 spoken web search results slides

Technology

l2f mtwv

cuhk mtwv

cmtechetal mtwv

iiith mtwv

elirf mtwv

tid mtwv

gtc mtwv

speed mtwv