Audient: An Acoustic Search Engine
Student: Ted LeathSupervisor: Prof. Paul Mc Kevitt
School of Computing and Intelligent SystemsFaculty of Engineering
University of Ulster, Magee
Aims and Objectives
• Development of Audient as a speech-centric, non-lexical search engine capable of handling multimodal queries for retrieving spoken audio information
• Explore the efficacy of using standards-based phonogrammic streams as an internal data representation for storing, indexing, searching and retrieving spoken audio information
• Compare the performance of optional compound strategies for the abstraction and refinement of standards-based phonogrammic streams
• Design, implement, refine and test Audient• Demonstrate research results, comparing Audient with
other existing system architectures
Literature Review
• Information Retrieval• Automatic Speech Recognition and Spoken
Document Retrieval• Current and previous research in SDR systems• Public access SDR systems• Commercial ASR and audio mining products• Sub-word based approaches to SDR• Transcripts, annotation and phonogrammic
streams• Speech and non-speech audio
Information Retrieval
• Typical Information Retrieval (IR) tasks involve the retrieval of relevant information items from various types of documents by matching a user request or query.
• IR encompasses document media types containing different types of information like images, video and audio information in addition to text documents. Audio recordings of speech can be referred to as spoken documents.
Automatic Speech Recognition• ASR attempts to mimic the human capacity for recognising speech by
enabling a computer to identify spoken words and/or sub-word units. Most current ASR systems are lexical in nature, and conceptually follow the processes of encoding and decoding introduced in the figure below:
(adapted from Young et al., 2002)
Spoken Document Retrieval• A significant amount of research has been conducted in SDR, and
performance evaluations like the Text REtrieval Conference (TREC) have encouraged development and the sharing of information. A diagram representing a typical TREC SDR process is reproduced below:
(Garfolo et al., 2000)
SDR Systems
• CMU Informedia I, Informedia II and Sphinx Projects(Hauptmann and Witbrock, 1997)
• Video Mail Retrieval and Multimedia Document Retrieval projects(Jones et al., 1997, Spärck Jones et al., 2001)
• SCAN (Choi et al., 1998 and Choi et al., 1999)
• THISL and Abbot (Abberley et al., 1998, Abbot, 1999)
• Taiscéalaí (Smeaton et al., 1998)
Public Access SDR Systems
• SpeechBot (Quinn, 2000, Van Thong et al., 2001)
• National Public Radio (NPR) Online(NPR, 2000, NPR Archives, 2004)
• SpeechFind and The National Gallery of the Spoken Word (Hansen et al., 2004, Zhou and Hansen, 2002)
Commercial ASR and Audio Mining Products
• BBN Rough ‘n’ Ready (Kubala et al., 1999) • Nexidia Fast-Talk and Convera RetrievalWare
(Clements et al., 2001a, Clements et al., 2001b)
• ScanSoft (Network Speech, 2004, Embedded Speech, 2004, MediaIndexer, 2004,
NaturallySpeaking, 2005, AudioMining, 2005, Xmode, 2004)
• Virage AudioLogger (Virage, 2004) • Nuance (Nuance, 2005)
• AT&T SCANMail (Hirschberg et al., 2001 and SCANMail, 2003) • Microsoft Speech Server (MSS, 2005)
Sub-word Based Approaches to SDR
• Wechsler (Wechsler, 1998) • Ng., K. (Ng, 2000)
• Glavitsch and Schäuble (Glavitsch and Schäuble, 1992)
• Ng., C. (Ng, 2001)
Also other sub-word research efforts including Larson (2001), Moreau et al. (2004)
Phonogrammic StreamsOrthographical representations of phonemic streams. This abstraction is ancient, and partially inherent in the English alphabet.
Egyptian hieroglyphs with semantic and phonetic value. Ref. http://www.omniglot.com/writing/egyptian.htm
Transcription1-best transcriptions
N-best transcriptions
Lattices or graphs
SILENCE HARD ROCK SILENCE
(Fundamentals, 2005)
(Fundamentals, 2005)
Annotation - Markup Languages and MPEG-7
• SSML
• VoiceXML
• SALT• XHTML+Voice profile
All of the above markup languages contain SSML as a subset
• MPEG-7 and spoken content
Non-Speech Audio Retrieval
• MELDEX• Musipedia (Melodyhound/Tuneserver)• Sonoda• Super MBox• MIRACLE• SMILE
• Shazam• Name That Clip• The Humdrum Toolkit• Themefinder• Boogeebot• Muscle Fish
Processing of speech is handled differently by humans than non-speech acoustic information.
Project Proposal
Audient Architecture
Audient Core Modules
Queries andTable Input
Phonemic Recognitionand Abstraction
Stream to Speech
Text to Stream
Create TranslationTable
Phonogrammic Streams,Location, Temporal
Information and Indexing
Text Query
Speech Query
DigitisedAudio
Stream
PhonogrammicStream
Phonogrammic Stream
DigitisedAudio
Streamand
Location
PhonogrammicStream
PhonogrammicMatch Request
PhonogrammicMatch Answer
SyntheticSpeech
Text TableComponent
PhonogrammaticTable
Component
Text
Converted Phonogrammic Stream
PhonogrammicQuery Result
TextTranslationInformation
AudioStreamReplay
Locationand
TemporalReference
Locationand
TemporalReference
DigitisedAudio
Streamand
Location
PhonogrammicStream,Locationand Temporal
Information
TextTranslationTable
Text forTranslation
PhongrammicTranslation
Audient ParrotsPhonetic and Temporal Abstraction
Audio Speech File 1
Phonogrammic Stream
Text to Speech
Audio Speech File 2
Speech RecognitionEngine
Compound Strategies (or none)
Audient Parrot
=
Document 1 Document 2
Audient ParrotReader speaks text to Audient Parrot
Writer records text from Audient Parrot
Document 1 and Document 2 are compared
Functional diagram for an Audient Parrot
Determining recognition differences
She sells sea shells by the seashore. She cells C shels bye the sea shore
Comparison with Previous Work
Software Analysis
• Hidden Markov Model Toolkit (HTK)• LVCSR and CSLU Toolkit• Sphinx-2, Sphinx-3, Sphinx-4• TIMIT• Linux and C++• Perl and PHP• Festival• The CMU Pronouncing Dictionary• SSML, VoiceXML, SALT and X+V• The Apache Web Server
Possible IR and Monitoring Applications
• The indexing search and retrieval of Internet audio files
• Indexing search and retrieval of broadcast media• Services for the blind• Library services• Surveillance and intelligence gathering• Voice mail • Audio mining and trend analysis (topic detection
and tracking)
Possible Philosophical and Cognitive Research Applications
• Artificial self-learning systems
• Philosophical investigations of speech-centric versus text-centric methods
• Research models for cognitive science and consciousness theories
• Examination of behaviourist versus cognitive semantic recognition of speech
Project Schedule
Conclusion
• The introduction of standards-based phonogrammic streams as a fundamental internal data structure
• Support for unconstrained multimodal queries• The development of new mimetic means for comparative
evaluation and demonstration• The provision of contextual strategies for the refinement
of phonogrammic streams• Movement of the man-machine boundary to allow more
effective partitioning of tasks between the human and the machine portions of the system
• Design, implementation and testing of the Audient acoustic search engine