![Page 1: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/1.jpg)
Audio Retrieval
LBSC 708A
Session 11, November 20, 2001
Philip Resnik
![Page 2: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/2.jpg)
Agenda
• Questions
• Group thinking session
• Speech retrieval
• Music retrieval
![Page 3: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/3.jpg)
Shoah Foundation Collection
• 52,000 interviews– 116,000 hours (13 years)– 32 languages
• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million
![Page 4: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/4.jpg)
Audio Retrieval
• We have already discussed three approaches– Controlled vocabulary indexing– Ranked retrieval based on associated captions– Social filtering based on other users’ ratings
• Today’s focus is on content-based retrieval– Analogue of content-based text retrieval
![Page 5: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/5.jpg)
Audio Retrieval
• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web
• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio
![Page 6: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/6.jpg)
The Size of the Problem• 30,000 hours in the Maryland Libraries
– Unique collections with limited physical access
• 116,000 hours in the Shoah collection
• Millions of hours of streaming audio each year– Becoming available worldwide on the web
• Broadcast news (audio/video)– Ex. Television archive
![Page 7: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/7.jpg)
![Page 9: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/9.jpg)
Audio Genres
• Speech-centered– Radio programs– Telephone conversations– Recorded meetings
• Music-centered– Instrumental, vocal
• Other sources– Alarms, instrumentation, surveillance, …
![Page 10: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/10.jpg)
Detectable Speech Features
• Content – Phonemes, one-best word recognition, n-best
• Identity – Speaker identification, speaker segmentation
• Language– Language, dialect, accent
• Other measurable parameters– Time, duration, channel, environment
![Page 11: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/11.jpg)
How Speech Recognition Works
• Three stages– What sounds were made?
• Convert from waveform to subword units (phonemes)
– How could the sounds be grouped into words?• Identify the most probable word segmentation points
– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences
• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)
![Page 12: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/12.jpg)
Using Speech Recognition
PhoneDetection
WordConstruction
WordSelection
Phonen-grams
Phonelattice
Words
Transcriptiondictionary
Languagemodel
One-besttranscript
Wordlattice
![Page 13: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/13.jpg)
• Segment broadcasts into 20 second chunks• Index phoneme n-grams
– Overlapping one-best phoneme sequences– Trained using native German speakers
• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary
• Vector space trigram matching– Identify ranked segments by time
ETHZ Broadcast News Retrieval
![Page 14: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/14.jpg)
Phoneme Trigrams
• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions
• But valid only for a single accent and dialect
– Rule-base transcription handles unknown words
• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh
![Page 16: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/16.jpg)
Cambridge Video Mail Retrieval• Added personal audio (and video) to email
– But subject lines still typed on a keyboard
• Indexed most probable phoneme sequences
![Page 17: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/17.jpg)
Cambridge Video Mail Retrieval
• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes
• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material
• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence
• Select from a ranked list of subject lines
![Page 18: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/18.jpg)
![Page 19: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/19.jpg)
![Page 20: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/20.jpg)
Contrast of Approaches• Rule-based transcription
– Potentially errorful– Broad coverage, handles unknown words
• Dictionary-based transcription– Good for smaller settings– Accurate
• Both susceptible to the problem of variability
![Page 21: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/21.jpg)
BBN Radio News Retrieval
![Page 22: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/22.jpg)
Comparison with Text Retrieval
• Detection is harder– Speech recognition errors
• Selection is harder– Date and time are not very informative
• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks
![Page 23: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/23.jpg)
Speaker Identification
• Gender– Classify speakers as male or female
• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample
• Speaker segmentation– Identify speaker changes– Count number of speakers
![Page 24: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/24.jpg)
A Richer View of Speech
• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing
• Topic segmentation via vocabulary shift– More natural breakpoints for browsing
• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching
![Page 25: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/25.jpg)
Other Possibly Useful Features
• Channel characteristics– Cell phone, landline, studio mike, ...
• Accent– Another way of grouping speakers
• Prosody– Detecting emphasis could help search or browsing
• Non-speech audio– Background sounds, audio cues
![Page 26: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/26.jpg)
Competing Demands on the Interface
• Query must result in a manageable set– But users prefer simple query interfaces
• Selection interface must show several segments– Representations must be compact, but informative
• Rapid examination should be possible– But complete access to the recordings is desirable
![Page 27: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/27.jpg)
Iterative Prototyping Strategy
• Select a user group and a collection• Observe information seeking behaviors
– To identify effective search strategies
• Refine the interface– To support effective search strategies
• Integrate needed speech technologies• Evaluate the improvements with user studies
– And observe changes to effective search strategies
![Page 28: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/28.jpg)
The VoiceGraph Project
• Exploring rich queries– Content-based, speaker-based, structure-based
• Multiple cues in the selection interface– Turn-taking, gender, query terms
• Flexible examination– Text transcript, audio skims
![Page 29: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/29.jpg)
Depicting Turn Taking Behavior
• Time is depicted from left to right
• Speakers separated vertically within a depiction
• Depictions stacked vertically in rank order
• Actual recordings are more complex
1
2
3
4
![Page 30: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/30.jpg)
![Page 31: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/31.jpg)
![Page 32: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/32.jpg)
Bootstrapping the Prototype
• Select a user population and a collection– Journalists and historians– Broadcast news from the 1960’s and 1970’s
• Mock up an interface– Pilot study to see if we’re on the right track
• Integrate “back end” speech processing– Recognition, identification, segmentation, ...
• Observe information seeking behaviors
![Page 33: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/33.jpg)
New Zealand Melody Index
• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm
• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries
• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment
• Display title, sheet music, and audio
![Page 34: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/34.jpg)
Contour Matching Example
• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD
• * represents the first note
• D represents a descending pitch (U is ascending)
• R represents a repetition (detectable split, same pitch)
• My singing produces:– *DDUDDUDRRUDRR
• Approximate string match finds 2 substitutions
![Page 35: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/35.jpg)
![Page 36: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/36.jpg)
Muscle Fish Audio Retrieval
• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth
• Segment at major discontinuities– Find average, variance, and smoothness of segments
• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching
• 4 features, 3 parameters for each, plus duration
– Then rank order using statistical classification
• Display file name and audio
![Page 38: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/38.jpg)
Summary
• Limited audio indexing is practical now– Audio feature matching, answering machine detection
• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design
![Page 39: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/39.jpg)
-
![Page 40: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/40.jpg)
October 1, 2001 LBSC 708R
Speech-Based Retrieval Systems
Douglas W. Oard
College of Library and Information Services
University of Maryland
![Page 41: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/41.jpg)
The Size of the Problem
• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access
• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate
• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!
![Page 42: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/42.jpg)
Outline
• Retrieval strategies
• Some examples
• Comparing speech and text retrieval
• Speech-based retrieval interface design
![Page 43: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/43.jpg)
Global Internet Audio
source: www.real.com, Mar 2001
10621438
English
OtherLanguages
Over 2500 Internet-accessible
Radio and TelevisionStations
![Page 44: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/44.jpg)
Shoah Foundation Collection
• 52,000 interviews– 116,000 hours (13 years)– 32 languages
• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million
![Page 45: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/45.jpg)
Speech Retrieval Approaches
• Controlled vocabulary indexing
• Ranked retrieval based on associated text
Automatic feature-based indexing
• Social filtering based on other users’ ratings
![Page 46: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/46.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
![Page 47: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/47.jpg)
![Page 49: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/49.jpg)
ETH Zurich Radio News Retrieval
![Page 50: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/50.jpg)
BBN Radio News Retrieval
![Page 51: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/51.jpg)
AT&T Radio News Retrieval
![Page 52: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/52.jpg)
MIT “Speech Skimmer”
![Page 53: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/53.jpg)
Cambridge Video Mail Retrieval
![Page 54: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/54.jpg)
![Page 55: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/55.jpg)
CMU Television News Retrieval
![Page 56: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/56.jpg)
Comparison with Text Retrieval
• Detection and ranking are harder– Because of speech recognition errors
• Selection is harder– Useful titles are sometimes hard to obtain– Date and time alone may not be informative
• Examination is harder– Browsing is harder in strictly linear media
![Page 57: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/57.jpg)
A Richer View of Speech
• Speaker identification– Known speakers– Gender labeling– “More like this” searches
• Topic segmentation– Find natural breakpoints for browsing
• Speaker segmentation– Extract turn-taking behavior
![Page 58: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/58.jpg)
Visualizing Turn-Taking
![Page 59: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/59.jpg)
Other Available Features
• Channel characteristics– Cell phone, landline, studio mike, ...
• Cultural factors– Language, accent, speaking rate
• Prosody– Emphasis detection
• Non-speech audio– Background sounds, audio cues
![Page 60: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/60.jpg)
Competing Demands on the Interface
• Query must result in a manageable set– But users prefer simple query interfaces
• Selection interface must show several segments– Representations must be compact, but informative
• Rapid examination should be possible– But complete access to the recordings is desirable
![Page 61: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/61.jpg)
The VoiceGraph Project
• Exploring rich queries– Content-based, speaker-based, structure-based
• Multiple cues in the selection interface– Turn-taking, gender, query terms
• Flexible examination– Text transcript, audio skims
![Page 62: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/62.jpg)
Pilot Study
• Student focus groups – 15 from Journalism, 3 from Library Science
• Preliminary drawing exercise
• Static screen shots and mock-ups
• Focused discussion
• User satisfaction questionnaire
• Structured interviews with domain experts– Journalism and Library Science faculty
![Page 63: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/63.jpg)
Pilot Study Results
• Graphical speech representations appear viable– Expected to be useful for high level browsing
• When coupled with text transcripts and audio replay
– Some training will be needed
• Suggested improvements– Adjust result set spacing to facilitate rapid selection – Identify categories (monologue, conversation, …)
• Potentially useful for search or browsing
![Page 64: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/64.jpg)
For More Information
• Speech-based information retrieval– http://www.clis.umd.edu/dlrg/speech/
• The VoiceGraph project– http://www.clis.umd.edu/dlrg/voicegraph/
![Page 65: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e745503460f94b74878/html5/thumbnails/65.jpg)
Comparison with Text Retrieval
• Detection is harder– Speech recognition errors
• Selection is harder– Date and time are not very informative
• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks