Download - March 30, 2005University of Kentucky Using Oral History to Learn About Searching Spontaneous Conversational Speech Douglas W. Oard College of Information

March 30, 2005 University of Kentucky

Using Oral History to Learn About Searching

Spontaneous Conversational Speech

Douglas W. OardCollege of Information Studies and

Institute for Advanced Computer StudiesUniversity of Maryland, College Park

http://www.umd.edu/

Outline

• Spoken word collections

• The MALACH Project

• Building a IR test collection

• First experiments

• Some things to think about

A Web of Speech?

Web in 1995 Speech in 2004

Storage(words per $)

300K 1.5M

Internet Backbone(simultaneous users)

250K 30M

“Last Mile”(Download time)

1 second(no graphics)

Streaming

Display Capability(Computers/US population)

10% 100%

Search Systems Lycos

Yahoo

SpeechBot

SingingFish

Spoken Word Collections

• Broadcast programming– News, interview, talk radio, sports, entertainment

• Scripted stories– Books on tape, poetry reading, theater

• Spontaneous storytelling– Oral history, folklore

• Incidental recording– Speeches, oral arguments, meetings, phone calls

Outline


The MALACH Project

• Building an IR test collection



Shoah Foundation Collection• Substantial scale

– 116,000 hours; 52,000 interviews; 32 languages

• Spontaneous conversational speech– Accents, elderly, emotional, …

• Accessible– $100 million collection and digitization investment

• Manually indexed (10,000 hours)– Segmented, thesaurus terms, people, summaries

• Users– A department working full time on dissemination

Interview Excerpt

• Audio characteristics– Accented (this one is unusually clear)– Separate channels for interviewer / interviewee

• Dialog structure• Interviewers have different styles

• Content characteristics– Domain-specific terms– Named entity mentions and relationships

The MALACH Project

SpeechTechnology

EnglishCzechRussianSlovak

LanguageTechnology

Topic SegmentationCategorizationExtractionTranslation

InteractiveSearch

Systems

Interface DevelopmentUser Studies

SearchTechnology

Test Collection

0

20

40

60

80

100

Jan-02 Jul-02 Jan-03 Jul-03 Jan-04 Jul-04 Jan-05

Wo

rd E

rro

r R

ate

(%)

English ASR Accuracy

Training: 200 hours from 800 speakers

Outline



Building an IR test collection



Who Uses the Collection?

• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement

• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use

Discipline Products

Based on analysis of 280 written requests

Observational Studies 8 independent searchers

– Holocaust studies (2)– German Studies– History/Political Science– Ethnography– Sociology– Documentary producer– High school teacher

8 teamed searchers– All high school teachers

Thesaurus-based search

Rich data collection– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture

Qualitative analysis– Theory-guided coding– Abductive reasoning

Clarify themes andDefine activities

Search and view testimony

Powerful testimonies give teachers ideas on what to discuss in the classroom (topic) and how to introduce it (activity).

Group discussions clarify themes and define activities, which hone teachers’ criteria.

“The brainstorming really guided my search today, and I felt like I finally had a big enough chunk of time on this to really find something, but I need about a couple hundred more hours … Yesterday in my search, I just felt like I was kind of going around in the dark. But that productive writing session really directed my search, even though I stayed with [the same testimony] the whole time.”

8 teachers, working in groups of 4

Thesaurus-Based Search

Theme Thesaurus Term(s)

Character-defining moments Denial of religion

Character-defining moments Forced labor

Character-defining moments (Acts of defiance) AND (humiliation and harassment)

Character-defining moments Decisions on flight

Making a difference Aid

8 teachers, working in groups of 4

Relevance Criteria

Criterion

Number of Mentions

All(N=703)

Think-Aloud

Relevance Judgment(N=300)

QueryForm.

(N=248)

Topicality 535 (76%) 219 234

Richness 39 (5.5%) 14 0

Emotion 24 (3.4%) 7 0

Audio/Visual Expression 16 (2.3%) 5 0

Comprehensibility 14 (2%) 1 10

Duration 11 (1.6%) 9 0

Novelty 10 (1.4%) 4 2

6 Scholars, 1 teacher, 1 film producer, working individually

Topicality

0 20 40 60 80 100 120 140

Object

Time Frame

Organization/Group

Subject

Event/Experience

Place

Person

Total mentions

6 Scholars, 1 teacher, 1 movie producer, working individually

Search Architecture

AutomaticSearch

BoundaryDetection

InteractiveSelection

ContentTagging

SpeechRecognition

QueryFormulation

MALACH Test Collection

AutomaticSearch

BoundaryDetection

SpeechRecognition

QueryFormulation

Topic Statements

Ranked Lists

EvaluationRelevanceJudgments

Mean Average Precision

Interviews

ComparableCollection

ContentTagging

ASR training, 800

Test collection, 247

Available by Dec 2005,

2486

Available now for expansion,

467

4,000 English Interviews

10,000 hours, full-description manual indexing

9.947 segments~400 words each(total: 625 hours)

<DOCNO>VHF00017-062567.005</DOCNO>

<KEYWORD> Warsaw (Poland), Poland 1935 (May 13) - 1939 (August 31), awareness of political or military events, schools </KEYWORD>

<PERSON> Sophie Perutz, Henry Hemar </PERSON>

<SUMMARY> AH talks about the college she attended before the war. She mentions meeting her husband. She discusses young peoples' awareness of the political events that preceded the outbreak of war. </SUMMARY>

<SCRATCHPAD> graduated HS, went to college 1 year, professional college hotel management; met future husband, knew that they'd end up together; sister also in college, nice social life, lots of company, not too serious; already got news from Czechoslovakia, Sudeten, knew that Poland would be next but what could they do about it, very passive; just heard info from radio and press </SCRATCHPAD>

<ASRTEXT> no no no they did no not not uh i know there was no place to go we didn't have family in a in other countries so we were not financially at the at extremely went so that was never at plano of my family it is so and so that was the atmosphere in the in the country prior to the to the war i graduate take the high school i had one year of college which was a profession and that because that was already did the practical trends f so that was a study for whatever management that eh eh education and this i i had only one that here all that at that time i met my future husband and that to me about any we knew it that way we were in and out together so and i was quite county there was so whatever i did that and this so that was the person that lived my sister was it here is first year of of colleagues and and also she had a very strongly this antisemitic trend and our parents there was a nice social life young students that we had open house always pleasant we had a lot of that company here and and we were not too serious about that she we got there we were getting the they already did knew he knew so from czechoslovakia from they saw that from other part and we knew the in that that he is uhhuh the hitler spicy we go into this year this direction that eh poland will be the next country but there was nothing that we would do it at that time so he was a very very he says belong to any any organizations especially that the so we just take information from the radio and from the dress </ASRTEXT>

Topic Construction

• 280 topical requests, in folders at VHF– From scholars, teachers, broadcasters, …

• 50 selected for use in the collection– Recast in TREC topic format– Some needed to be “broadened”

• 30 assessed during Summer 2003– 28 yielded at least 5 relevant segments

An Example TopicNumber: 1148

Title: Jewish resistance in Europe

Description:Provide testimonies or describe actions of Jewish resistance in Europe before and during the war.

Narrative:The relevant material should describe actions of only- or mostly Jewish resistance in Europe. Both individual and group-based actions are relevant. Type of actions may include survival (fleeing, hiding, saving children), testifying (alerting the outside world, writing, hiding testimonies), fighting (partisans, uprising, political security) Information about undifferentiated resistance groups is not relevant.

Assessment Strategy• Exhaustive (Cranfield) is not scalable

• Pooled (TREC) is not yet possible– Requires a diverse set of ASR and IR systems– Will be used at CLEF 2005 Speech Retrieval track

• Search-guided (TDT) was viable– Iterate topic research/search/assessment– Augment with review, adjudication, reassessment– Requires an effective interactive search system– 28 topics: 821 hours/3 months/4 assessors

Defining Topical “Relevance”

• “Classic” relevance (to “food in Auschwitz”)

Direct Knew food was sometimes withheld

Indirect Saw undernourished people

• Additional relevance typesContext Intensity of manual laborComparison Food situation in a different campPointer Mention of a study on the subject

Recording Judgments

Average: 3.2 minutes per judgment

• 14 topics independently assessed– Assessors later met to resolve differences

• 14 topics assessed and then reviewed– Decisions of the reviewer were final

Mapping to Binary Relevance

Degree Overall Direct Indirect Context Comparison Pointer

0 1088 1964 3440 2807 3280 3390

1 433 87 19 61 80 18

2 666 250 70 371 105 76

3 853 702 86 285 147 112

4 603 640 28 119 31 47

3,643 adjudicated judgment pairs

“Relevant”

Number of judgments, by type and degree of relevance

Assessor Agreement

0.0

0.2

0.4

0.6

0.8

1.0

Overall Direct Indirect Context Comparison Pointer

Relevance Type

Kap

pa

14 topics, 4 assessors in 6 pairings, 1806 judgments

(2122) (1592) (184) (775) (283) (235)

44% topic-averaged overlap for Direct+Indirect 2/3/4 judgments

Outline



• Building an IR test collection

First experiments


ASR-Based Search

0.00

0.02

0.04

0.06

0.08

0.10

Inquery Character n-grams

Okapi Okapi +Query

Expansion

Mea

n A

vera

ge P

reci

sion

Title queries, adjudicated judgments


0

0.1

0.2

0.3

0.4

0.5

ASR Scratchpad ThesTerm Summary Metadata

Me

an

Av

era

ge

Pre

cis

ion

+Persons

Comparing Index Terms

0.0

0.2

0.4

0.6

0.8

1.011

8816

3011

8716

2814

46

1330

1345

1181

1225

1431

211

92

1551

1414

1623

1605

1179

Topic

Ave

rag

e P

reci

sio

n

ASR Metadata


Comparing ASR and Metadata

Failure Analysis


ASR % of MetadataSomewhere in ASR Results

(bold occur in <35 segments)

in ASR Lexicon

Only in Metadata

wit(ness) eichmann

jew(s) volkswagen

labor camp ig farben

slave labor telefunken aeg

minsk ghetto underground

wallenberg eichmann

bomb birkenau

sonderkommando auschwitz

liber(ation) buchenwald dachau

jewish kapo(s)

kindertransport

ghetto life

fort ontario refugee camp

jewish partisan(s) poland

jew(s) shanghai

bulgaria save jew(s)

0% 20% 40% 60% 80% 100%

1179

1605

1623

1414

1551

1192

14312

1225

1181

1345

1330

1446

1628

1187

1188

1630

What Causes the Difference?

• Hypothesis 1: Good human indexers– Maybe people don’t speak the query terms– Human indexers can still detect the topic

• Hypothesis 2: Weak ASR language model– ASR does best on “newspaper terms”

• {Bulgaria, partisans} >> {Auschwitz, sonderkommando}

– Mixture: 200 hours in-domain + gigaword corpus

Failure Analysis


ASR % of MetadataSomewhere in ASR Results

(bold occur in <35 segments)

in ASR Lexicon

Only in Metadata

wit(ness) eichmann

jew(s) volkswagen

labor camp ig farben

slave labor telefunken aeg

minsk ghetto underground

wallenberg eichmann

bomb birkenau

sonderkommando auschwitz

liber(ation) buchenwald dachau

jewish kapo(s)

kindertransport

ghetto life

fort ontario refugee camp

jewish partisan(s) poland

jew(s) shanghai

bulgaria save jew(s)

0% 20% 40% 60% 80% 100%

1179

1605

1623

1414

1551

1192

14312

1225

1181

1345

1330

1446

1628

1187

1188

1630


0.0

0.2

0.4

0.6

0.8

1.0

Ave

rag

e P

reci

sio

n

ASR

ASR+Rel+Top10

Metadata

jewish kapo(s) fort ontariorefugee camp

Searching Manual Transcripts

English Speech Recognition for the MALACH Project

36 MALACH NSF Site Visit 06/14/2004

Ideas ExploredChallenge Algorithm Success

Names, Places, Foreign words Metadata (PIQ) adapted vocabularies

Huge

Automated transcriptions of large volumes of data based on speaker-dependent models

Automatic Acoustic Segmentation and Speaker Clustering

High

Pronunciation Variations Syllable-centric Models Moderate

Accents, Age Accent-Specific Lexicon, Error Analysis

Moderate

Exploit large-volume of speaker-specific data

Confidence-measure based incremental adaptation

Moderate

Noise Robustness Spectral subtraction (MMSE filtering), PLP, MVDR, Audio Visual Models

Small

Exploit structure of the interview Topic-based Language Models Small

English Speech Recognition for the MALACH Project

37 MALACH NSF Site Visit 06/14/2004

Results on 15 interviews

Vocabulary Size

Static VocabularyNE WER (%)

Metadata adapted VocabularyNE WER (%)

Relative Gain (%)

30K 66.2 32.3 51.2

60K 62.1 36.8 40.7

90K 61.4 38.8 36.8

Vocabulary Size Static VocabularyWER (%)

Metadata adapted Vocabulary WER (%)

30K 40.1 37.6

60K 39.4 36.7

90K 39.2 36.5

Named Entity Word Error Rate (NE WER)

Overall Word Error Rate (WER)

Halved the Named Entity WER!

Key Result: Use smaller , metadata-adapted vocabularies

Natural Language Processing Components: Topic Segmentation, Categorization

MALACH NSF Site Visit 6/14/2004

Categorization Using Automatic Speech Recognition

kNN, test: 332 segments, 216 categories with 10+ training samples

Document-based best F [%] Human178 hours

ASR ~178 hours

ASR580 hours

Test: human 37.7 37.2 39.2

Test: ASR, 42% WER 33.2 34.2 36.8

Equal performance for human and ASR transcripts

Improvement with additional training data

Training:

training ASR Word Error Rate ~ 47%



Segment with Categorieswhat can you tell me about the holidays in your housethey were very very nice and my father was very religious and everything was kept

like it's suppose to be you know like it's written in the Torah …and there was a extra room special if a very poor man came and he stayed overnight

there was a room for him to sleep there …my mother mainly she thought that this is a very big mitzvah to do you know because

there in in where I come from there was a lot of poor people not from our town from out of town but use to come and sometimes they couldn't make it back home so they slept over and sometime they stayed over Shabbats …

my mother helped in the business and we had a nanna that took care of us and we also had a maid in the house…my mother only did the cooking …

my father with his second wife didn't have children for twenty years he lived with her and they didn't have any children …

then he married my mother and with my mother they had four children…

Jewish customs and observancefamily lifesocioeconomic statusCzechoslovakia 11/11/1918 - 3/14/1939

Jewish customs and observancefamily lifeextended family membersfamily homesPoland 11/11/1918 - 8/31/1939

Precision: 2/5, Recall: 2/4, F = 44%

Human Indexing kNN

Category Expansion

Spoken Words (hand transcribed)

ThesaurusTerms

3,199 Training segments

Spoken Words(ASR transcript)

ThesaurusTerms

test segments

kNNCategorization

Index

F=0.19(microaveraged)

0.0941

0.00

0.02

0.04

0.06

0.08

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n A

vera

ge

Pre

cisi

on

Title queries, linear score combination, adjudicated judgments

ASRWords

ThesaurusTerms

ASR-Based Search

0.00

0.02

0.04

0.06

0.08

0.10

Inquery Character n-grams

Okapi Qkapi +Query

Expansion

Okapi +Category

Expansion

Okapi +QE+CE

Mea

n A

vera

ge P

reci

sion


+27%

Average of3.4 relevantsegments in

top 20

Sensitivity Analysis

0.00

0.02

0.04

0.06

0.08

0.10

0.12D

ir+

Ind

2-4

Dir

ect

2-4

Ind

ire

ct 2

-4

Ove

rall

2-4

Dir

ect

3-4

Ind

ire

ct 3

-4

Ove

rall

3-4

Mea

n A

vera

ge

Pre

cisi

on

Inquery Okapi+QueryExp+CatExp



Human-Assigned Segment Boundary

... because the roads were crowded with with army units going back and forth you know .. and you also were off you had to walk no on the main road because you were afraid you were going to be picked up for work .. that's what some did they came to Loetche and some people were picked up and held four weeks for work .. when they came home they told us

we came we came home was was about the time of Succoth .. you know the city was deserted there was a they were already taking people to work .. when we came home we couldn't recognize the city .. my parents first of all they confiscated everything .. they told us to get out of the orchard .. they took whatever they wanted they took over the whole ranch ...

--- segment boundary ---

on the way

arrival



Probabilistic Models for Segmentation

Model features• semantic

left-right window similarity• lexical

“key” words and phrases: yes: “tell me”, “back to”, no: “did they”, “and there”

• prosodicsilence duration, rate of speech

• structuralposition in the file, clause length

…when they came home they told us

we came home was about the time of Succoth…



Segmentation Using Automatic Speech Recognition

training ASR Word Error Rate ~ 47%

Equal Error Rate [%](Miss Rate = False Alarm Rate)

Human178 hours

ASR ~178 hours

ASR580 hours

Test: human 24.2 24.1 24.1

Test: ASR, 42% WER 24.8 24.9 23.5

Equal performance for human and ASR transcripts

Modest improvement with additional ASR training data

Training:

CLEF CL-SR Evaluation

• Test collection release– Available on research license from ELDA

• Packaged as a standard IR collection– ASR Transcripts / known topic boundaries– Contrastive metadata

• Training topics (with relevance judgments)• 25 Double-blind evaluation topics

– Runs due June 1 2005, results returned Aug 1

• Plan to add Czech in 2006

What Have We Learned?

• User studies help guide test collection design– Named entities are important to scholars– Age at time of experience is important to teachers

• Test collections guide component development– Dynamic ASR lexicon cuts NE error rate in half

• Text classification seems to be helping– Presently depends on lexical overlap w/thesaurus

The MALACH Team

• Shoah Foundation– Sam Gustman

• Cambridge University– Bill Byrne

• Johns Hopkins– Jim Mayfield (APL)

• Charles University– Jan Hajic

• Univ of West Bohemia– Josef Psutka

• IBM TJ Watson– Bhuvana Ramabhadran– Michael Picheny– Martin Franz– Nanda Kambhatla

• University of Maryland– Doug Oard (IS)– Dagobert Soergel (IS)– David Doermann (CS)– Bonnie Dorr (CS)– Philip Resnik (Linguistics)

Some Things to Think About

• Privacy protection– Working with real data has real consequences

• Are fixed segments the right retrieval unit?– Or is it good enough to know where to start?

• What will it cost to tailor an ASR system?– $100K to $1 million per application?

• Is ASR fast enough to really scale up?– 0.1 to 10 machine-hours per hour of speech

For More Information

• The MALACH project– http://www.clsp.jhu.edu/research/malach

• CLEF-2005 evaluation– http://www.clef-campaign.org

• NSF/DELOS Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag

Backup Slides

The Need for Scalable Solutions

0.00

01

0.00

1

0.01 0.1 1 10 100

1000

Phone calls in a day

Webcasts in a year

British Library

Shoah Foundation

SingingFish

SpeechBot

TDT

Millions of Hours

196,000 Annotated Segments

Subject PersonLocation-Time

Berlin-1939 Employment Josef Stein

Berlin-1939 Family life Gretchen Stein Anna Stein

Dresden-1939 Schooling Gunter Wendt Maria

Dresden-1939 Relocation Transportation-rail inte

rvie

w ti

me

+ Segment summaries + Indexer’s notes

Comparing Index Terms

0.0

0.1

0.2

0.3

0.4

0.5

ASR Notes ThesTerm Summary Metadata Metadata+ASR

Me

an

Av

era

ge

Pre

cis

ion

Full

Title

Topical relevance, adjudicated judgments, Inquery

Density of Relevant SegmentsRelevance Judgments

0

200

400

600

800

Topics

Nu

mb

er

of

Se

gm

en

ts

Not Relevant

Relevant

Collection size: 9,947 segments

Adjudicated judgments

fort ontariorefugee camp

jewishkapo(s)

Effect of Judgment Differences

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.1 0.2 0.3 0.4 0.5

Mean Average Precision, Single-Assessor Judgments

Mea

n A

vera

ge P

reci

sion

, A

dju

dic

ated

Jud

gmen

ts

18 experiments4 reversals (<0.01 absolute, <4% relative)

Various conditions and systems

Czech AudioCzech Audio (15min./interview)

ASR System

Czech TranscriptsLanguage

Model

Bilingual Dictionary kNN

Classification

Cross Language Retrieval System

Searching Czech

Download - March 30, 2005University of Kentucky Using Oral History to Learn About Searching Spontaneous Conversational Speech Douglas W. Oard College of Information

Top Related