March 30, 2005 University of Kentucky
Using Oral History to Learn About Searching
Spontaneous Conversational Speech
Douglas W. OardCollege of Information Studies and
Institute for Advanced Computer StudiesUniversity of Maryland, College Park
Outline
• Spoken word collections
• The MALACH Project
• Building a IR test collection
• First experiments
• Some things to think about
A Web of Speech?
Web in 1995 Speech in 2004
Storage(words per $)
300K 1.5M
Internet Backbone(simultaneous users)
250K 30M
“Last Mile”(Download time)
1 second(no graphics)
Streaming
Display Capability(Computers/US population)
10% 100%
Search Systems Lycos
Yahoo
SpeechBot
SingingFish
Spoken Word Collections
• Broadcast programming– News, interview, talk radio, sports, entertainment
• Scripted stories– Books on tape, poetry reading, theater
• Spontaneous storytelling– Oral history, folklore
• Incidental recording– Speeches, oral arguments, meetings, phone calls
Outline
• Spoken word collections
The MALACH Project
• Building an IR test collection
• First experiments
• Some things to think about
Shoah Foundation Collection• Substantial scale
– 116,000 hours; 52,000 interviews; 32 languages
• Spontaneous conversational speech– Accents, elderly, emotional, …
• Accessible– $100 million collection and digitization investment
• Manually indexed (10,000 hours)– Segmented, thesaurus terms, people, summaries
• Users– A department working full time on dissemination
Interview Excerpt
• Audio characteristics– Accented (this one is unusually clear)– Separate channels for interviewer / interviewee
• Dialog structure• Interviewers have different styles
• Content characteristics– Domain-specific terms– Named entity mentions and relationships
The MALACH Project
SpeechTechnology
EnglishCzechRussianSlovak
LanguageTechnology
Topic SegmentationCategorizationExtractionTranslation
InteractiveSearch
Systems
Interface DevelopmentUser Studies
SearchTechnology
Test Collection
0
20
40
60
80
100
Jan-02 Jul-02 Jan-03 Jul-03 Jan-04 Jul-04 Jan-05
Wo
rd E
rro
r R
ate
(%)
English ASR Accuracy
Training: 200 hours from 800 speakers
Outline
• Spoken word collections
• The MALACH Project
Building an IR test collection
• First experiments
• Some things to think about
Who Uses the Collection?
• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement
• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use
Discipline Products
Based on analysis of 280 written requests
Observational Studies 8 independent searchers
– Holocaust studies (2)– German Studies– History/Political Science– Ethnography– Sociology– Documentary producer– High school teacher
8 teamed searchers– All high school teachers
Thesaurus-based search
Rich data collection– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture
Qualitative analysis– Theory-guided coding– Abductive reasoning
Clarify themes andDefine activities
Search and view testimony
Powerful testimonies give teachers ideas on what to discuss in the classroom (topic) and how to introduce it (activity).
Group discussions clarify themes and define activities, which hone teachers’ criteria.
“The brainstorming really guided my search today, and I felt like I finally had a big enough chunk of time on this to really find something, but I need about a couple hundred more hours … Yesterday in my search, I just felt like I was kind of going around in the dark. But that productive writing session really directed my search, even though I stayed with [the same testimony] the whole time.”
8 teachers, working in groups of 4
Thesaurus-Based Search
Theme Thesaurus Term(s)
Character-defining moments Denial of religion
Character-defining moments Forced labor
Character-defining moments (Acts of defiance) AND (humiliation and harassment)
Character-defining moments Decisions on flight
Making a difference Aid
8 teachers, working in groups of 4
Relevance Criteria
Criterion
Number of Mentions
All(N=703)
Think-Aloud
Relevance Judgment(N=300)
QueryForm.
(N=248)
Topicality 535 (76%) 219 234
Richness 39 (5.5%) 14 0
Emotion 24 (3.4%) 7 0
Audio/Visual Expression 16 (2.3%) 5 0
Comprehensibility 14 (2%) 1 10
Duration 11 (1.6%) 9 0
Novelty 10 (1.4%) 4 2
6 Scholars, 1 teacher, 1 film producer, working individually
Topicality
0 20 40 60 80 100 120 140
Object
Time Frame
Organization/Group
Subject
Event/Experience
Place
Person
Total mentions
6 Scholars, 1 teacher, 1 movie producer, working individually
Search Architecture
AutomaticSearch
BoundaryDetection
InteractiveSelection
ContentTagging
SpeechRecognition
QueryFormulation
MALACH Test Collection
AutomaticSearch
BoundaryDetection
SpeechRecognition
QueryFormulation
Topic Statements
Ranked Lists
EvaluationRelevanceJudgments
Mean Average Precision
Interviews
ComparableCollection
ContentTagging
ASR training, 800
Test collection, 247
Available by Dec 2005,
2486
Available now for expansion,
467
4,000 English Interviews
10,000 hours, full-description manual indexing
9.947 segments~400 words each(total: 625 hours)
<DOCNO>VHF00017-062567.005</DOCNO>
<KEYWORD> Warsaw (Poland), Poland 1935 (May 13) - 1939 (August 31), awareness of political or military events, schools </KEYWORD>
<PERSON> Sophie Perutz, Henry Hemar </PERSON>
<SUMMARY> AH talks about the college she attended before the war. She mentions meeting her husband. She discusses young peoples' awareness of the political events that preceded the outbreak of war. </SUMMARY>
<SCRATCHPAD> graduated HS, went to college 1 year, professional college hotel management; met future husband, knew that they'd end up together; sister also in college, nice social life, lots of company, not too serious; already got news from Czechoslovakia, Sudeten, knew that Poland would be next but what could they do about it, very passive; just heard info from radio and press </SCRATCHPAD>
<ASRTEXT> no no no they did no not not uh i know there was no place to go we didn't have family in a in other countries so we were not financially at the at extremely went so that was never at plano of my family it is so and so that was the atmosphere in the in the country prior to the to the war i graduate take the high school i had one year of college which was a profession and that because that was already did the practical trends f so that was a study for whatever management that eh eh education and this i i had only one that here all that at that time i met my future husband and that to me about any we knew it that way we were in and out together so and i was quite county there was so whatever i did that and this so that was the person that lived my sister was it here is first year of of colleagues and and also she had a very strongly this antisemitic trend and our parents there was a nice social life young students that we had open house always pleasant we had a lot of that company here and and we were not too serious about that she we got there we were getting the they already did knew he knew so from czechoslovakia from they saw that from other part and we knew the in that that he is uhhuh the hitler spicy we go into this year this direction that eh poland will be the next country but there was nothing that we would do it at that time so he was a very very he says belong to any any organizations especially that the so we just take information from the radio and from the dress </ASRTEXT>
Topic Construction
• 280 topical requests, in folders at VHF– From scholars, teachers, broadcasters, …
• 50 selected for use in the collection– Recast in TREC topic format– Some needed to be “broadened”
• 30 assessed during Summer 2003– 28 yielded at least 5 relevant segments
An Example TopicNumber: 1148
Title: Jewish resistance in Europe
Description:Provide testimonies or describe actions of Jewish resistance in Europe before and during the war.
Narrative:The relevant material should describe actions of only- or mostly Jewish resistance in Europe. Both individual and group-based actions are relevant. Type of actions may include survival (fleeing, hiding, saving children), testifying (alerting the outside world, writing, hiding testimonies), fighting (partisans, uprising, political security) Information about undifferentiated resistance groups is not relevant.
Assessment Strategy• Exhaustive (Cranfield) is not scalable
• Pooled (TREC) is not yet possible– Requires a diverse set of ASR and IR systems– Will be used at CLEF 2005 Speech Retrieval track
• Search-guided (TDT) was viable– Iterate topic research/search/assessment– Augment with review, adjudication, reassessment– Requires an effective interactive search system– 28 topics: 821 hours/3 months/4 assessors
Defining Topical “Relevance”
• “Classic” relevance (to “food in Auschwitz”)
Direct Knew food was sometimes withheld
Indirect Saw undernourished people
• Additional relevance typesContext Intensity of manual laborComparison Food situation in a different campPointer Mention of a study on the subject
Recording Judgments
Average: 3.2 minutes per judgment
• 14 topics independently assessed– Assessors later met to resolve differences
• 14 topics assessed and then reviewed– Decisions of the reviewer were final
Mapping to Binary Relevance
Degree Overall Direct Indirect Context Comparison Pointer
0 1088 1964 3440 2807 3280 3390
1 433 87 19 61 80 18
2 666 250 70 371 105 76
3 853 702 86 285 147 112
4 603 640 28 119 31 47
3,643 adjudicated judgment pairs
“Relevant”
Number of judgments, by type and degree of relevance
Assessor Agreement
0.0
0.2
0.4
0.6
0.8
1.0
Overall Direct Indirect Context Comparison Pointer
Relevance Type
Kap
pa
14 topics, 4 assessors in 6 pairings, 1806 judgments
(2122) (1592) (184) (775) (283) (235)
44% topic-averaged overlap for Direct+Indirect 2/3/4 judgments
Outline
• Spoken word collections
• The MALACH Project
• Building an IR test collection
First experiments
• Some things to think about
ASR-Based Search
0.00
0.02
0.04
0.06
0.08
0.10
Inquery Character n-grams
Okapi Okapi +Query
Expansion
Mea
n A
vera
ge P
reci
sion
Title queries, adjudicated judgments
Title queries, adjudicated judgments
0
0.1
0.2
0.3
0.4
0.5
ASR Scratchpad ThesTerm Summary Metadata
Me
an
Av
era
ge
Pre
cis
ion
+Persons
Comparing Index Terms
0.0
0.2
0.4
0.6
0.8
1.011
8816
3011
8716
2814
46
1330
1345
1181
1225
1431
211
92
1551
1414
1623
1605
1179
Topic
Ave
rag
e P
reci
sio
n
ASR Metadata
Title queries, adjudicated judgments
Comparing ASR and Metadata
Failure Analysis
Title queries, adjudicated judgments
ASR % of MetadataSomewhere in ASR Results
(bold occur in <35 segments)
in ASR Lexicon
Only in Metadata
wit(ness) eichmann
jew(s) volkswagen
labor camp ig farben
slave labor telefunken aeg
minsk ghetto underground
wallenberg eichmann
bomb birkenau
sonderkommando auschwitz
liber(ation) buchenwald dachau
jewish kapo(s)
kindertransport
ghetto life
fort ontario refugee camp
jewish partisan(s) poland
jew(s) shanghai
bulgaria save jew(s)
0% 20% 40% 60% 80% 100%
1179
1605
1623
1414
1551
1192
14312
1225
1181
1345
1330
1446
1628
1187
1188
1630
What Causes the Difference?
• Hypothesis 1: Good human indexers– Maybe people don’t speak the query terms– Human indexers can still detect the topic
• Hypothesis 2: Weak ASR language model– ASR does best on “newspaper terms”
• {Bulgaria, partisans} >> {Auschwitz, sonderkommando}
– Mixture: 200 hours in-domain + gigaword corpus
Failure Analysis
Title queries, adjudicated judgments
ASR % of MetadataSomewhere in ASR Results
(bold occur in <35 segments)
in ASR Lexicon
Only in Metadata
wit(ness) eichmann
jew(s) volkswagen
labor camp ig farben
slave labor telefunken aeg
minsk ghetto underground
wallenberg eichmann
bomb birkenau
sonderkommando auschwitz
liber(ation) buchenwald dachau
jewish kapo(s)
kindertransport
ghetto life
fort ontario refugee camp
jewish partisan(s) poland
jew(s) shanghai
bulgaria save jew(s)
0% 20% 40% 60% 80% 100%
1179
1605
1623
1414
1551
1192
14312
1225
1181
1345
1330
1446
1628
1187
1188
1630
Title queries, adjudicated judgments
0.0
0.2
0.4
0.6
0.8
1.0
Ave
rag
e P
reci
sio
n
ASR
ASR+Rel+Top10
Metadata
jewish kapo(s) fort ontariorefugee camp
Searching Manual Transcripts
English Speech Recognition for the MALACH Project
36 MALACH NSF Site Visit 06/14/2004
Ideas ExploredChallenge Algorithm Success
Names, Places, Foreign words Metadata (PIQ) adapted vocabularies
Huge
Automated transcriptions of large volumes of data based on speaker-dependent models
Automatic Acoustic Segmentation and Speaker Clustering
High
Pronunciation Variations Syllable-centric Models Moderate
Accents, Age Accent-Specific Lexicon, Error Analysis
Moderate
Exploit large-volume of speaker-specific data
Confidence-measure based incremental adaptation
Moderate
Noise Robustness Spectral subtraction (MMSE filtering), PLP, MVDR, Audio Visual Models
Small
Exploit structure of the interview Topic-based Language Models Small
English Speech Recognition for the MALACH Project
37 MALACH NSF Site Visit 06/14/2004
Results on 15 interviews
Vocabulary Size
Static VocabularyNE WER (%)
Metadata adapted VocabularyNE WER (%)
Relative Gain (%)
30K 66.2 32.3 51.2
60K 62.1 36.8 40.7
90K 61.4 38.8 36.8
Vocabulary Size Static VocabularyWER (%)
Metadata adapted Vocabulary WER (%)
30K 40.1 37.6
60K 39.4 36.7
90K 39.2 36.5
Named Entity Word Error Rate (NE WER)
Overall Word Error Rate (WER)
Halved the Named Entity WER!
Key Result: Use smaller , metadata-adapted vocabularies
Natural Language Processing Components: Topic Segmentation, Categorization
MALACH NSF Site Visit 6/14/2004
Categorization Using Automatic Speech Recognition
kNN, test: 332 segments, 216 categories with 10+ training samples
Document-based best F [%] Human178 hours
ASR ~178 hours
ASR580 hours
Test: human 37.7 37.2 39.2
Test: ASR, 42% WER 33.2 34.2 36.8
Equal performance for human and ASR transcripts
Improvement with additional training data
Training:
training ASR Word Error Rate ~ 47%
Natural Language Processing Components: Topic Segmentation, Categorization
MALACH NSF Site Visit 6/14/2004
Segment with Categorieswhat can you tell me about the holidays in your housethey were very very nice and my father was very religious and everything was kept
like it's suppose to be you know like it's written in the Torah …and there was a extra room special if a very poor man came and he stayed overnight
there was a room for him to sleep there …my mother mainly she thought that this is a very big mitzvah to do you know because
there in in where I come from there was a lot of poor people not from our town from out of town but use to come and sometimes they couldn't make it back home so they slept over and sometime they stayed over Shabbats …
my mother helped in the business and we had a nanna that took care of us and we also had a maid in the house…my mother only did the cooking …
my father with his second wife didn't have children for twenty years he lived with her and they didn't have any children …
then he married my mother and with my mother they had four children…
Jewish customs and observancefamily lifesocioeconomic statusCzechoslovakia 11/11/1918 - 3/14/1939
Jewish customs and observancefamily lifeextended family membersfamily homesPoland 11/11/1918 - 8/31/1939
Precision: 2/5, Recall: 2/4, F = 44%
Human Indexing kNN
Category Expansion
Spoken Words (hand transcribed)
ThesaurusTerms
3,199 Training segments
Spoken Words(ASR transcript)
ThesaurusTerms
test segments
kNNCategorization
Index
F=0.19(microaveraged)
0.0941
0.00
0.02
0.04
0.06
0.08
0.10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n A
vera
ge
Pre
cisi
on
Title queries, linear score combination, adjudicated judgments
ASRWords
ThesaurusTerms
ASR-Based Search
0.00
0.02
0.04
0.06
0.08
0.10
Inquery Character n-grams
Okapi Qkapi +Query
Expansion
Okapi +Category
Expansion
Okapi +QE+CE
Mea
n A
vera
ge P
reci
sion
Title queries, adjudicated judgments
+27%
Average of3.4 relevantsegments in
top 20
Sensitivity Analysis
0.00
0.02
0.04
0.06
0.08
0.10
0.12D
ir+
Ind
2-4
Dir
ect
2-4
Ind
ire
ct 2
-4
Ove
rall
2-4
Dir
ect
3-4
Ind
ire
ct 3
-4
Ove
rall
3-4
Mea
n A
vera
ge
Pre
cisi
on
Inquery Okapi+QueryExp+CatExp
Natural Language Processing Components: Topic Segmentation, Categorization
MALACH NSF Site Visit 6/14/2004
Human-Assigned Segment Boundary
... because the roads were crowded with with army units going back and forth you know .. and you also were off you had to walk no on the main road because you were afraid you were going to be picked up for work .. that's what some did they came to Loetche and some people were picked up and held four weeks for work .. when they came home they told us
we came we came home was was about the time of Succoth .. you know the city was deserted there was a they were already taking people to work .. when we came home we couldn't recognize the city .. my parents first of all they confiscated everything .. they told us to get out of the orchard .. they took whatever they wanted they took over the whole ranch ...
--- segment boundary ---
on the way
arrival
Natural Language Processing Components: Topic Segmentation, Categorization
MALACH NSF Site Visit 6/14/2004
Probabilistic Models for Segmentation
Model features• semantic
left-right window similarity• lexical
“key” words and phrases: yes: “tell me”, “back to”, no: “did they”, “and there”
• prosodicsilence duration, rate of speech
• structuralposition in the file, clause length
…when they came home they told us
we came home was about the time of Succoth…
Natural Language Processing Components: Topic Segmentation, Categorization
MALACH NSF Site Visit 6/14/2004
Segmentation Using Automatic Speech Recognition
training ASR Word Error Rate ~ 47%
Equal Error Rate [%](Miss Rate = False Alarm Rate)
Human178 hours
ASR ~178 hours
ASR580 hours
Test: human 24.2 24.1 24.1
Test: ASR, 42% WER 24.8 24.9 23.5
Equal performance for human and ASR transcripts
Modest improvement with additional ASR training data
Training:
CLEF CL-SR Evaluation
• Test collection release– Available on research license from ELDA
• Packaged as a standard IR collection– ASR Transcripts / known topic boundaries– Contrastive metadata
• Training topics (with relevance judgments)• 25 Double-blind evaluation topics
– Runs due June 1 2005, results returned Aug 1
• Plan to add Czech in 2006
What Have We Learned?
• User studies help guide test collection design– Named entities are important to scholars– Age at time of experience is important to teachers
• Test collections guide component development– Dynamic ASR lexicon cuts NE error rate in half
• Text classification seems to be helping– Presently depends on lexical overlap w/thesaurus
The MALACH Team
• Shoah Foundation– Sam Gustman
• Cambridge University– Bill Byrne
• Johns Hopkins– Jim Mayfield (APL)
• Charles University– Jan Hajic
• Univ of West Bohemia– Josef Psutka
• IBM TJ Watson– Bhuvana Ramabhadran– Michael Picheny– Martin Franz– Nanda Kambhatla
• University of Maryland– Doug Oard (IS)– Dagobert Soergel (IS)– David Doermann (CS)– Bonnie Dorr (CS)– Philip Resnik (Linguistics)
Some Things to Think About
• Privacy protection– Working with real data has real consequences
• Are fixed segments the right retrieval unit?– Or is it good enough to know where to start?
• What will it cost to tailor an ASR system?– $100K to $1 million per application?
• Is ASR fast enough to really scale up?– 0.1 to 10 machine-hours per hour of speech
For More Information
• The MALACH project– http://www.clsp.jhu.edu/research/malach
• CLEF-2005 evaluation– http://www.clef-campaign.org
• NSF/DELOS Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag
Backup Slides
The Need for Scalable Solutions
0.00
01
0.00
1
0.01 0.1 1 10 100
1000
Phone calls in a day
Webcasts in a year
British Library
Shoah Foundation
SingingFish
SpeechBot
TDT
Millions of Hours
196,000 Annotated Segments
Subject PersonLocation-Time
Berlin-1939 Employment Josef Stein
Berlin-1939 Family life Gretchen Stein Anna Stein
Dresden-1939 Schooling Gunter Wendt Maria
Dresden-1939 Relocation Transportation-rail inte
rvie
w ti
me
+ Segment summaries + Indexer’s notes
Comparing Index Terms
0.0
0.1
0.2
0.3
0.4
0.5
ASR Notes ThesTerm Summary Metadata Metadata+ASR
Me
an
Av
era
ge
Pre
cis
ion
Full
Title
Topical relevance, adjudicated judgments, Inquery
Density of Relevant SegmentsRelevance Judgments
0
200
400
600
800
Topics
Nu
mb
er
of
Se
gm
en
ts
Not Relevant
Relevant
Collection size: 9,947 segments
Adjudicated judgments
fort ontariorefugee camp
jewishkapo(s)
Effect of Judgment Differences
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4 0.5
Mean Average Precision, Single-Assessor Judgments
Mea
n A
vera
ge P
reci
sion
, A
dju
dic
ated
Jud
gmen
ts
18 experiments4 reversals (<0.01 absolute, <4% relative)
Various conditions and systems
Czech AudioCzech Audio (15min./interview)
ASR System
Czech TranscriptsLanguage
Model
Bilingual Dictionary kNN
Classification
Cross Language Retrieval System
Searching Czech