high level knowledge sources in usable speech recognition...

12
ARTICLES HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION SYSTEMS The authors detail an integrated system which combines natural language processing with speech understanding in the context of a problem solving dialogue. The MINDS system uses a variety of pragmatic knowledge sources to dynamically generate expectations of what a user is likely to say. SHERYL R. YOUNG,ALEXANDERG. HAUPTMANN, WAYNE H. WARD, EDWARD T. SMITH, and PHILIPWERNER Understanding speech is a difficult problem. The ulti- mate goal of all speech recognition research is to create an intelligent assistant, who listens to what a user tells it and then carries out the instructions. An apparently simpler goal is the listening typewriter, a device which merely transcribes whatever it hears with only a few seconds delay. The listening typewriter seems simple, but in reality the process of transcription requires al- most complete understanding as well. Today, we are still quite far from these ultimate goals. But progress is being made. One of the major problems in computer speech rec- ognition and understanding is coping with large search spaces. The search space for speech recognition con- tains all the acoustic associated with words in the lexi- con as well as all the legal word sequences. Today, the most widely used recognition systems are based on hidden Markov models (HMM) [Z]. In these systems, typically, each word is represented as a sequence of phonemes, and each phoneme is associated with a se- quence of phonemes, and each phoneme is associated with a sequence of states. In general, the search space size increases as the size of the network of states in- creases. As search space size increases, speech recogni- tion performance decreases. Knowledge can be used to constrain the exponential growth of a search space and This research was sponsored by the Defense Advance Research Projects Agency (DOD). ARPA Order No. 5167. monitored by the Air Force Avionics Laboratory under contract N00039.85-C-0163. The views and conclusions contained in this document are those of the authors and should not be inter- preted as representing the official politics. either expressed or implied. of the Defense Advanced Research Projects Agency or the US Government. d:’ lY8Y ACM 000.0782/89/0200-0183 $1.50 hence increase processing speed and recognition accu- racy [9, 171. Currently, the most common approach to constraining search space is to use a grammar. The grammars used for speech recognition constrain legal word sequences. Normally they are used in a strict left to right fashion and embody syntactic and semantic constraints on individual sentences. These constraints are represented in some form of probabilistic or seman- tic network which does not change from utterance to utterance [16-181. As we move toward habitable systems and sponta- neous speech, the search space problem is greatly mag- nified. Habitable systems permit users to speak natu- rally. Grammars for naturally spoken sentences are significantly larger than the small grammars typically used by speech recognition systems. When one consid- ers interjections, restarts and additional natural speech phenomena, the search space problem is further com- pounded. These problems point to the need for using knowledge sources beyond syntax and semantics to constrain the speech recognition process. There are many other knowledge sources besides syntax and semantics. Typically, these are clustered into the category of pragmatic knowledge. Pragmatic knowledge includes inferring and tracking plans, using context across clausal and sentence boundaries, deter- mining local and global constraints on utterances and dealing with definite and pronominal reference. Work in the natural language community has shown that pragmatic knowledge sources are important for under- standing language. People communicate to accomplish goals, and the structure of the plans to accomplish them are well understood [9, 24, 26, 32, 331. When speech is used in a structured task such as problem solving, prag- February 1989 Volume 32 Number 2 Communications of the ACM 183

Upload: others

Post on 01-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

ARTICLES

HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION SYSTEMS

The authors detail an integrated system which combines natural language processing with speech understanding in the context of a problem solving dialogue. The MINDS system uses a variety of pragmatic knowledge sources to dynamically generate expectations of what a user is likely to say.

SHERYL R. YOUNG, ALEXANDER G. HAUPTMANN, WAYNE H. WARD, EDWARD T. SMITH, and PHILIP WERNER

Understanding speech is a difficult problem. The ulti- mate goal of all speech recognition research is to create an intelligent assistant, who listens to what a user tells it and then carries out the instructions. An apparently simpler goal is the listening typewriter, a device which merely transcribes whatever it hears with only a few seconds delay. The listening typewriter seems simple, but in reality the process of transcription requires al- most complete understanding as well. Today, we are still quite far from these ultimate goals. But progress is being made.

One of the major problems in computer speech rec- ognition and understanding is coping with large search spaces. The search space for speech recognition con- tains all the acoustic associated with words in the lexi- con as well as all the legal word sequences. Today, the most widely used recognition systems are based on hidden Markov models (HMM) [Z]. In these systems, typically, each word is represented as a sequence of phonemes, and each phoneme is associated with a se- quence of phonemes, and each phoneme is associated with a sequence of states. In general, the search space size increases as the size of the network of states in- creases. As search space size increases, speech recogni- tion performance decreases. Knowledge can be used to constrain the exponential growth of a search space and

This research was sponsored by the Defense Advance Research Projects Agency (DOD). ARPA Order No. 5167. monitored by the Air Force Avionics Laboratory under contract N00039.85-C-0163. The views and conclusions contained in this document are those of the authors and should not be inter- preted as representing the official politics. either expressed or implied. of the Defense Advanced Research Projects Agency or the US Government.

d:’ lY8Y ACM 000.0782/89/0200-0183 $1.50

hence increase processing speed and recognition accu- racy [9, 171. Currently, the most common approach to constraining search space is to use a grammar. The grammars used for speech recognition constrain legal word sequences. Normally they are used in a strict left to right fashion and embody syntactic and semantic constraints on individual sentences. These constraints are represented in some form of probabilistic or seman- tic network which does not change from utterance to utterance [16-181.

As we move toward habitable systems and sponta- neous speech, the search space problem is greatly mag- nified. Habitable systems permit users to speak natu- rally. Grammars for naturally spoken sentences are significantly larger than the small grammars typically used by speech recognition systems. When one consid- ers interjections, restarts and additional natural speech phenomena, the search space problem is further com- pounded. These problems point to the need for using knowledge sources beyond syntax and semantics to constrain the speech recognition process.

There are many other knowledge sources besides syntax and semantics. Typically, these are clustered into the category of pragmatic knowledge. Pragmatic knowledge includes inferring and tracking plans, using context across clausal and sentence boundaries, deter- mining local and global constraints on utterances and dealing with definite and pronominal reference. Work in the natural language community has shown that pragmatic knowledge sources are important for under- standing language. People communicate to accomplish goals, and the structure of the plans to accomplish them are well understood [9, 24, 26, 32, 331. When speech is used in a structured task such as problem solving, prag-

February 1989 Volume 32 Number 2 Communications of the ACM 183

Page 2: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

matic knowledge sources are available for constraining search spaces.

In the past, pragmatic dialogue level knowledge sources were used in speech to either correct speech recognition errors [4, lo] or to disambiguate spoken in- put and perform inferences required for understanding [ZO, 2’1, 30, 351. In these systems, pragmatic knowledge was applied to the output of the recognizer.

In this article we describe an approach for flexibly using contextual constraints to dynamically circum- scribe the search space for words in a speech signal. We use pragmatic knowledge to derive constraints about what the user is likely to say next. Then we loosen the constraints in a principled manner. We generate lay- ered sets of predictions which range from very specific to very general. To enable the speech system to give priority to recognizing what a user is most likely to say, each prediction set dynamically generates a grammar which is used by the speech recognizer. The prediction sets are tried in order of most specific first, until an acceptable parse is found. This allows optimum per- formance when users behave predictably, and displays graceful degradation when they do not. The imple- mented system (MINDS) uses these layered constraints to guide the search for words in our speech recognizer. For our recognizer, we use a modified version of the SPHINX [19] large vocabulary, speaker indepen- dent, continuous speech recognition system. The MINDS spoken language dialogue system developed at Carnegie-Mellon University applies pragmatic knowledge-based constraints as early as possible in the speech recognition process to eliminate incorrect recog- nition choices and drastically reduce the speech system error rate.

The main problem in speech recognition is the enor- mous complexity involved in analyzing speech input. Variations in pronunciation, accent, speaker physiol- ogy, emphasis and characteristics of the acoustic envi- ronment typically produce hundreds of different possi- ble phoneme classifications for each sound. In turn, the many phoneme classifications can result in many possi- ble word hypotheses at each point in the utterance. All of these word choices can then be combined to yield hundreds of sentence candidates for each utterance. The resulting search space is huge. Yet a speech system is required to filter out all incorrect candidates and correctly recognize an utterance in real time. Different approaches have been used in the past to limit the exponential explosion of the search space and trim the computational complexity to a more manageable level.

In an attempt to reduce complexity, the speech rec- ognition problem has been simplified along different dimensions. The first speech recognition systems [18] were tailored to specific speakers only. This reduced much of the speech signal variation due to speaker characteristics such as sex, age, accent and physiologi- cal characteristics. The early systems also only recog- nized very few words. This reduction in vocabulary eliminated much confusion during recognition, espe- cially if the words were all acoustically distinct. To

avoid the problem of slurred and coarticulated words, the early systems required that each word be pro- nounced separately and that the speaker pause slightly between words. A final simplification of the speech problem was to artificially limit the number of different words that could be used at any one point. Similar to the choices available in a series of menus, the speaker could only use one of a few words at any place in the utterance. This technique depends on the previously uttered sentential context to reduce the so-c,alled branching factor. The effective vocabulary at each point is made much smaller than the overall vocabu- lary available to the system.

Speech technology has made great strides in the recent past. We are now in a position where we can progress beyond systems that merely type out a sen- tence which was read to it for demonstration purposes. The speech recognition research focus has shifted to- ward integrated spoken language systems, which can be used by people trying to accomplish a task. For the rest of this article we will only be concerned with speaker independent, large vocabulary, connected speech recog- nition systems.

THE NEED TO INTEGRATE SPEECH AND NATURAL LANGUAGE Speech recognition techniques at the word level are inadequate. Error rates are fairly high even for the best currently available systems. The Carnegie-Mellon SPHINX system [19] is considered to be the best speaker independent connected speech recognition system today. But even the Sphinx system has an error rate of 29.4 percent for speaker independent, 1,000 word connected speech recognition, when recog- nizing individual words in sentences without using knowledge about syntax or semantics. Clearly we need some forms of higher level knowledge to understand speech better. This is the kind of knowledge that has been used for years by researchers concerned with nat- ural language understanding of typed input.

Just using the modules developed by the typed natu- ral language understanding community is not as simple as it may seem. There are a number of very specific demands to a speech system interface which differ from a typed system interface. Many of the techniques for parsing typed natural language do not adapt well to speech specific problems. The following po:ints high- light some of the unique speech problems not found in typed natural language.

Probability Measures. There is nothing uncertain about what a person has typed.

The ASCII characters are transmitted unambiguously to the program. The speech system has to deal with many uncertainties during recognition. At each level of pro- cessing, the uncertainties compound. The result of each processing step is usually expressed in terms of some probability or likelihood estimate. Any techniques de- veloped for typed natural language systems: need to be

184 Communications of the ACM February 1989 Volume 32 Number 2

Page 3: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

adapted to account for different alternatives with differ- ent probabilities.

Identifying Words. The speech system usually has many words hypothesized for each word actually spo- ken. The word identification problem has several com- ponents, some of those include:

l Uncertainty of the location of a word in the input. For almost every acoustic event in the utterance, there are words hypothesized that start and end at this event. Many of these words overlap and are mutually exclusive.

l Multiple alternatives at every word location. Even at the correct word location, the system is never cer- tain of what word is actual input. A list of word candidates is usually hypothesized with different probabilities.

l Word boundary and juncture identification. Often words are coarticulated such that their boundaries became merged and unrecognizable. Some milk is a classic example of a phrase where the two words overlap without a clear boundary.

Phonetic Ambiguity of Words. Many words sound com- pletely alike and the orthographic representation can only be distinguished from larger context. This is the case for ice cream and I scream.

Syllable Omissions. Spoken language tends to be terse. Because people are so good at disambiguating the speech signal, speakers unconsciously omit syllables. An example of this is frequently found in the pronunci- ation of United States which is reduced to sound like unite states in everyday speech.

Missing Information. The speech system will occasion- ally fail to recognize the correct word completely. Even though the speaker may have said the word correctly, it cannot be hypothesized from the acoustic evidence. There will be too many other word candidates that receive a better score and the correct word will be left out.

Ungrammatical Input. If miss-typing is the kind of phenomenon a standard natural language system has to deal with, speech systems encounter miss-spoken words and filled pauses like ah and uhm which further complicate recognition 1161. Natural human speech is also more likely to be ungrammatical [6].

The effect of all these differences requires speech recognition systems to deal with many more alterna- tives and a much larger search space than typed natu- ral language systems. Therefore all techniques that have been developed by the natural language process- ing community must be restructured if they are to be adapted for the speech specific problems. They espe- cially must be adapted to deal with the huge search spaces that result from the magnitude of the problem if these knowledge sources are to be used to assist in actual speech recognition.

Uses of Knowledge in Speech Recognition Systems In the past, speech systems have used a variety of dif- ferent kinds of knowledge sources to reduce the magni- tude of the search space. The following list describes the major information sources used by different speech systems. We restrict ourselves to enumerating the knowledge sources above the level of complete words as they apply to sentences, dialogues, user goals and user focus.

l Word Transition ProbabiIities. If one wants to use knowledge of more than one word, the obvious solution is to use two words. By analyzing a large set of training sentences, a matrix of word pairs is constructed. The sentences are analyzed individually, without regard to dialogue structure, focus of attention or user goals. The resulting word pair matrix indicates which words can follow an already recognized word. A further extension of this method uses likelihoods of transitions encoded in the matrix instead of just binary values. Not only do we know which word pairs are legal, but we also have an indication of how likely they are. Empirically de- rived trigrams of words have also been used. Here a matrix is computed which, when given a sequence of the two preceding words, indicates which words can immediately follow at this point. Variations on the word transition probability estimates using Markov modeling techniques have been used by [Z]. A minor modification of this approach uses word categories in- stead of words. Word categories are independent of the actual vocabulary size and require less training data to establish the transition probability matrix. While this approach does well to reduce the amount of search that is required, there is still much information missing in the triplets of allowable words [28, 311.

l Syntactic Grammars. A syntactic grammar first di- vides all words into different syntactic categories. In- stead of using transition probabilities between word pairs or triplets, a syntactic grammar specifies all possi- ble sequences of syntax word categories for a sentence [%I. Network grammars seem to be the most efficient representation for this type of constraint, since fast processing times are crucial in a speech system. Other grammar parsing representations are not as efficient when faced with the large numbers of candidates in a speech recognition situation. While the grammar may be written in a different notation, it can usually be compiled down to a network for the actual speech processing. The big drawback of these grammars is that they are difficult to construct by hand. They also as- sume the speaker will produce an utterance which is recognizable by the grammar.

l Semantic Grammars. Semantic grammars have been the most popular form of sentential information en- coded in speech recognition systems. The grammar rules are similar to those of syntactic grammars, but words are categorized by a combination of syntactic class and semantic function. Only sentences that are both syntactically well formed as well as meaningful in the context of the application will be recognized

February 1989 Volume 32 Number 2 Communications of the ACM 185

Page 4: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

by a semantic grammar. Semantic grammars express stronger constraints than syntactic grammars, but also require more rules. These grammars are also easily rep- resentable as networks. Compared to the syntactic- grammars above, they are even more difficult to con- struct by hand. Nevertheless, most speech systems which use higher level knowledge have chosen to use semantic grammars as their main sentential knowledge source [5, 17, 18, 21, 361.

Some speech recognition systems emphasized seman- tic structure while minimizing syntactic dependencies [12, 161. This approach results in a large number of choices due to the lack of appropriate constraints. The recognition performance therefore suffers due to the increased ambiguities. None of these systems proposed to use any knowledge beyond the constraints within sing1.e sentences.

l Thematic Memory. Barnett [3] describes a speech recognition system which uses a notion of history for the last sentences. The system keeps track of previously recognized content words and predicts that they are likely to reoccur. The possibility of using a dialogue structure is mentioned by Barnett, but no results or implementation details are reported. The thematic mem- ory idea was picked up again by Tomabechi and Tomita [29], who demonstrated an actual implementation in a sophisticated frame-based system. Both speech recogni- tion systems use an utterance to activate a context. This context is then transformed into word expecta- tions which prime the speech recognition system for the next utterance.

l History-based Models. Fink, Biermann and others [4, 101 implemented a system that used a dialogue fea- ture to correct errors made by a small vocabulary, com- mercial speech recognition system. Their module was strictly history-based. It remembered previously recog- nized meanings (i.e., semantic structures) of sentences as a finite state dialogue network. If the currently ana- lyzed utterance was semantically similar to one of the stored sentence meanings and the system was at a simi- lar state of the dialogue at that time, the stored mean- ing can be used to correct the recognition of the new utterance. Significant improvements were found in both sentence and word error rates when a predic- tion from a previous utterance could be applied. How- ever, the history-based expectation was only applied after a word recognition module had processed the speech, in an attempt to correct recognition errors.

l Strategy Knowledge. Strategy knowledge was ap- plied as a constraint in the voice chess application of Hearsay-I [25]. The task domain was defined by the rules of chess. The “situational semantics of the conver- sation” were given by the current board position. De- pending on these, a list of legal moves could be formu- lated which represented plausible hypotheses of what a user might say next. In addition, a user model was defined to order the moves in terms of the goodness of a move. From these different knowledge sources, an exhaustive list of all sentences possible was derived. This list of sentences was then used to constrain the

acoustic-phonetic speech recognition. Hearsay-I went too far in the restriction of constraints. A classic anec- dote tells of the door slamming during a demonstration and the system recognizing the sentence: “Pawn to Queen 4”. Hearsay-I applied its constraints in an extremely limited domain and overly restricted what could be said. Nevertheless, the principles of using a user model, task semantics and situational semantics are valid.

l Natural Language Back-Ends. !$everal speech recog- nition systems claim to have dialogue, discourse or pragmatic components [20, 21, 30, 351. However, most of these systems only use this knowledge just like any typed natural language understanding system would. The speech input is processed by a speech recognition module which uses all its constraints up through the level of semantic grammars to arrive at a single best sentence candidate. This sentence is then transformed into the appropriate database query, anaphoric refer- ences are resolved, elliptic utterances are completed and the discourse model is updated. All these higher level procedures are applied after the sentence is com- pletely recognized by the speech front-end. There is no interaction between the natural language processing modules and the speech recognizer.

Natural Language Research There has been much research on discourse, focus, planning, inference and problem solving strategies in the natural language processing community. Some of the research was not directly carried out in the context of natural language systems, but describes methods for representation and analysis of these issues. We will briefly review the key principles which influenced the design of the MINDS spoken language dialogue system.

Plans. The utility of tracking plans and goals in a story has been well established. A number of research- ers [l, 221 have described the utility of identifying a speaker’s goals to disambiguate natural language sen- tences and to provide helpful system responses. Simi- larly, Cohen and Perrault [8] have developed a plan- based dialogue model for understanding indirect speech acts. A program called PAM [32] showed how an under- standing of a person’s goal can explain an action of the person. The goal and subsequent actions can also be used to infer the particular plan which the person is using to achieve this goal. Additionally, Wjlensky [33] developed methods for dealing with competing goals and partial goal failures. The ways in which a plan can be broken down into a hierarchical set of g,oals and subgoals was originally demonstrated by Sacerdoti [26].

Problem Solving. Newell and Simon [24] were key in- fluences in the study of human problem solving. Among other things, they showed how people con- stantly break goals into subgoals when solving prob- lems. Their findings, as well as much of th’e other re- search done in this area [22] illustrate the function of user goals represented as goal trees, and traversal pro- cedures for goal trees.

188 Communications of the ACM February 1989 Volume 32 Number 2

Page 5: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

Focus. Focus determines a set of relevant concepts for a particular situation. Grosz [13] found that natural lan- guage communication is highly structured at the level of dialogues and problem solving. She showed how the notion of a user focus in problem solving dialogues is related to a partitioning of the semantic space. Focus can also provide an indication how to disambiguate cer- tain input. Additional work by Sidner [27] confirmed the use of focus as a powerful notion in natural lan- guage understanding. Sidner successfully used a focus mechanism to restrict the possibilities of referent deter- mination in pronominal anaphora.

Ellipsis. Elliptical utterances are incomplete sen- tences which rely on previous context to become meaningful. Methods for interpreting elliptical utter- ances were studied in depth by Frederking [ll]. He used a chart-based representation to remember frag- ments of preceding sentences which were suitable com- plements for elliptic phrases.

User Domain Knowledge. Chin [7] showed how the knowledge of the user about the domain can influence the expectations and behavior of a system. In addition, he described ways in which the user expertise could be inferred by the system.

THE MINDS SYSTEM The main problem in speech recognition is the enor- mous complexity involved in analyzing speech input. As search space size increases recognition performance decreases and processing speed increases. The value of a reduced search space and stronger constraints is well known in the speech recognition community [9, 17, 231. Reducing the search to only the most promising word candidates by pruning often erroneously eliminates the correct path. By applying knowledge-based constraints as early as possible, one can trim the exponential ex- plosion of the search space to a more manageable size without eliminating correct choices. Previously we de- limited the key knowledge sources incorporated into the MINDS system to provide constraint. Now we briefly overview the entire MINDS system and enumer- ate the primary innovations of this new approach. The approach employs knowledge based constraints to re- duce the exponential growth of the search space the speech recognizer must analyze.

To demonstrate our new approach in speech recogni- tion, we have built MINDS, a Multi-modal, INteractive Dialog System [14, 381. It allows a user to speak, type and point during a problem solving session with the system. The system outputs information in a variety of media. It produces a natural language answer to the user’s question as well as displaying relevant informa- tion on multiple raster display screens. These screens display the current information in various contexts em- phasizing different aspects of the dialogue history and the current world situation. The MINDS system oper- ates in real time.

MINDS works in a resource management domain, featuring information obtained from a database of facts about ships in the United States Navy. The basic prob- lem scenario involves a damaged ship performing a par- ticular task. The system user must determine the im- pact of the damage on the mission and then determine whether the damaged ship should continue in degraded condition or should be replaced by a different ship. If a replacement ship is being considered, the user must locate one with similar capabilities and a ship which will have minimal impact on other mission operations. An excerpt from a sample interaction transcript can be found in Figure 1.

For the purposes of this article, MINDS can be viewed as a speaker-independent continuous speech recognition system that uses dialogue knowledge, user goals, plans and focus to understand what was said in its naval logistics problem solving domain. The system uses this higher level knowledge of dialogues in addi- tion to a representation of user’s domain knowledge to predict what the current user will talk about next. The predictions drastically reduce the search space before the sentence and word detection modules even begin to analyze the speech input.

In very general terms, we can describe the main op- erations of the MINDS system as a continuous loop. First, the system generates a set of predictions based on the last user query, the database response and the state of the dialogue. Then, the predictions are translated

List the casualty reports for ships in the Persian Gulf?

Gridley is in the Persian Gulf. Harpoon missile is disabled. What's its mission?

Gridleys mission is escort. Show me Gridley's required capabilities.

Harpoon missile, Phalanx gun, SPS-10 radar are required capabilities

Let me see all Leahy-class cruisers in the Peizsian Gulf.

England is a Leaky-class skip. It is a cruiser. It is in the Persian Gulf. Grid& is a Leahy-class skip. It is a cruiser. It is in the Persian Gulf

What is the mission for England? Englands mission is transport.

Priority? Englands mission importance is 1.

What are her capabilities? If has Harpoon missile. It has Phalanx gun. It has SPS-10 radar. It has torpedo tubes. It has SM-2 missile.

Does England have any CASREPS? Nothing found in the database.

FIGURE 1. An Excerpt of a Typical Dialogue with the MINDS System

February 1989 Volume 32 Number 2 Communications of the ACM 187

Page 6: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

into a semantic grammar. The active lexicon is re- stricted to only cover words which are part of the pre- dictions. The next user query is parsed using the dy- namically created grammar and lexicon, and the user’s question is displayed for verification. The database response to user request is then presented in output modalities.

Innovations of the MINDS System The MINDS system represents a radical departure from the principles of most other speech recognition systems. The key innovations of the MINDS system include the following:

1.

2.

3.

4.

5.

Use of a combination of knowledge sources includ- ing discourse and dialogue knowledge, problem solv- ing, knowledge, pragmatics, user domain knowledge goal representation, as well as task semantics and syntax in an integrated system. All the knowledge is used predictively. Instead of applying the knowledge to correct an error or re- solve ambiguities after they occur, the knowledge is applied in a predictive way to constrain all possibili- ties as they are generated. The constraints generated by the system are imme- diately applied to the low-level speech processing to reduce the search space. We use the predictive con- straints to eliminate large portions of the search space for the earliest acoustic-phonetic analysis. In case the predictions fail, the system provides a principled way of recovery when constraints are vio- lated. If the constraints are satisfied, recognition is more accurate and faster. However, if some of our predictions are violated the system does not break- it degrades gracefully. In addition to speech input, the MINDS system al- lows pointing and clicking as well as typed modes of interaction. The user may use the mouse to select objects displayed graphically on the screen. For those users who are uncomfortable speaking to a computer, anything that could be spoken can also be typed on a keyboard.

MINDS exploits knowledge about users’ domain knowledge problem solving strategy, their goals and focus as well as the general structure of a dialogue to constrain speech recognition down to the signal pro- cessing level. Pragmatic knowledge sources are used pre- dictively to circumscribe the search space for words in the speech signal [14, 371. In contrast to other systems, we do not correct misrecognition errors after they hap- pen, but apply our constraints as early as possible dur- ing the analysis of an utterance. Our approach uses predictions derived from the problem-solving dialogue situation to limit the search space at the lower levels of speech processing. At each point in the dialogue, we predict a set of concepts that may be used in the next utterance. This list of concepts is combined with a set of syntactic networks for possible sentence structures. The result is a dynamically constructed semantic net-

work grammar, which reflects all the constraints de- rived from all our knowledge sources. To avoid getting trapped by predictions which are not fulfilled, we gen- erate them at different level of specificity. When the parser then analyzes the spoken utterance, the dynamic network allows only a very restricted set of word choices at each point. This reduces the amount of search necessary and cuts down on the possibility of recognition errors due to ambiguity and confusion be- tween words. The bottom line is that the MINDS sys- tem uses as much knowledge as possible to achieve accurate speech recognition and help the user complete his task efficiently.

The Predictive Use of Knowledge Predictions are derived from what we know about the current state of the dialogue. This knowledge is then refined to constrain what we expect the user will ac- tually s,ay. In some sense predictions cover everything we expect to happen, and exclude events which are unlikely to happen. To be able to create predictions, we have three very important data structures in the system:

A knowledge base of domain concepts. In this data struc- ture we represent all objects and their attributes in the domain. The representation uses a standard frame lan- guage which provides the capability to express inherit- ance and multiple relations between frames. The do- main concepts also represent everything that can be expressed by the user. Each possible utterance will map into a combination of domain concepts which consti- tute the meaning of that utterance. This representation of meaning is also used to generate the database queries from the utterance.

Hierarchical goal frees. The goal trees represent a hier- archy of all possible abstract goals as user may have during the dialogue. The goal trees are composed of individual goal nodes, structured as AND-0:R trees. Each goal node is characterized by the possible subgoals it can be decomposed into and a set of domain concepts involved in trying to achieve this g,oal. The concepts associated with a goal tend to be restricted from previous dialog context. These restrictions on con- cept expansions are dynamically computed for each concept. The computation is based upon principles for inheriting and propagating constraints based upon their embedding and are often referred to as local and global focus in the natural language literature. The goal tree not only defines the goals, subgoals and domain con- cepts, but also the traversal options available to the user. A goal node’s associated concepts can be optional or required, single use or multiple use. If a concept is optional, it is possible but not necessary for a user to apply this concept in the current problem-solving step. If a concept is defined as multi-usable, then a user could refer to it several times in different utterances during the current problem solving step.

188 Communications of the ACM Februa y 1989 Volume 32 Number 2

Page 7: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

A User Model. A user model represents domain con- cepts and the relations between domain concepts which a user knows about. These models are repre- sented as control structures which are associated with goals in the goal tree. The control structures express which goals may be exclusive because the user can infer the information in one goal once the other, exclu- sive goal has been completed. Other goals may be op- tional because the user is unfamiliar with the domain concept or its potential importance in deriving a solu- tion to the current problem. Additionally, the control structures contain probabilistic orderings for conjunc- tive subgoals. Hence, the user model provides the sys- tem with potential traversal options which are more restrictive than the traversal options provided by the other knowledge sources.

These three complex data structures are currently only constructible by hand-based on a detailed and careful analysis of the problem-solving task itself.

We will now try to explain how the knowledge is used by the MINDS system to track the progress of a user during a problem-solving session. We will also show how predictions in the form of domain concepts are generated during a dialogue.

When an input utterance and its database response have been processed, we first try to determine which goal states were targeted by the present interaction. Determination of activated goal states is by no means unambiguous. During one interaction cycle, several goal states may be completed and many new goal states may be initiated. Similarly, it is possible that an as- sumed goal state is not being pursued by the user. To deal with these ambiguities, we use a number of algo- rithms. Goals that have just been completed by this interaction and that are consistent with previous plan steps are preferred. Based on the information we have available, we select the most likely plan step to be exe- cuted next. If the current goal is not complete, then our most likely plan step will attempt to complete the cur- rent goal. If the current goal is satisfied, we identify the next goal states to which a user could transit. The re- sult of tracking the goal states is a list of potential goals and subgoals which a user will try to complete in the current utterance. Additionally, since there are always many active goals which may or may not be hierarchi- cally embedded, we also maintain a list of all active goals. Hence, the procedures described above are used for determining the best, most likely goal state a user will transit to. To generate our most restrictive predic- tions, we then restrict the most likely goal state further by taking the constraints from the user model. The next prediction layer ignores the user model and is derived only from the best, most likely next goal. Finally, addi- tional less restrictive sets of predictions are derived from currently active goals which are at higher levels of the goal tree. This procedure continues until all ac- tive goals are incorporated into a prediction set. The goals all have an associated list of concepts in the task

domain. These are the concepts a user will refer to when trying to satisfy the current goal.

For example, in a goal state directed at assessing a ship’s damage, we expect the ship’s name to appear frequently in both user queries and system statements. We also expect the user to refer to the ship’s capabili- ties. The predicted sentence structures should allow questions about the features of a ship like “Does its sonar still work?” Display the status of all radars for the Spark” and “What is Badger’s current speed?”

Some domain concepts which are active at a goal tree node during a particular dialogue phase have been par- tially restricted by previous goal states. The representa- tion of the domain concepts associated with goal nodes provides a mechanism to specify what prior goal can restrict the current concept. These restrictions may come either from the user’s utterances or from the sys- tem responses. Thus each goal state not only has a list of active domain concepts, but also a set of concepts whose values were partially determined by an earlier goal state.

In our example, once we know which ship was dam- aged, we can be sure all statements in the damage as- sessment phase will refer to the name of that ship or its hull number only. In addition to the knowledge men- tioned earlier, we also restrict what kinds of anaphoric referents are available at each goal node. The possible anaphoric referents are determined by user focus. From the current goal or subgoal state, focus identifies previ- ously mentioned dialogue concepts and answers which are relevant at this point. These concepts are expecta- tions of the referential content of anaphora in the next utterance.

Continuing our example, it does not make sense to refer to a ship as “it” before the ship’s name has been mentioned at least once. We also do not expect the use of anaphoric “it” if we are currently talking about a group of several potential replacement ships.

Elliptic utterances are predicted when we expect the user to ask about several concepts of the same type after having seen a query for the first concept.

If the users have just asked about the damage to the sonar equipment of a ship, and we expect them to query about damage to the radar equipment, we must include the expectation for an elliptic utterance about radar in our predictions.

Expanding Predictions into Networks After the dialogue tracking module has identified the set of concepts which could be referred to in the next utterance, we need to expand these into possible sen- tence fragments. Since these predicted concepts are ab- stract representations, they must be translated into word sequences which signify that appropriate concep- tual meaning. For each concept, we have precompiled a set of possible surface forms, which can be used in an actual utterance. In effect, we reverse the classic un- derstanding process by un-parsing the conceptual rep- resentation into all possible word strings which can de-

February 1989 Volume 32 Number 2 Communications of the ACM 199

Page 8: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

note the concept. A predicted concept can be quickly un-parsed into all its possible semantic network gram- mar subnets.

In addition to the individual concepts, which usually expand into noun phrases, we also have a complete semantic network grammar that has been partitioned into subnets. Each subnet expresses a complete sen- tence. A subnet defines allowable syntactic surface forms to express a particular semantic content. For ex- ample, all ways of asking for the capabilities of ships are grouped together into subnets. The semantic net- work is further partitioned into separate subnets for elliptical utterances, and subnets for anaphora. The semantic grammar subnets are precompiled to allow direct access for processing efficiency. The terminal nodes in the networks are word categories instead of words themselves, so no recompilation is necessary as new lexical items in existing categories are added to or removed from the lexicon.

The final expansion of predictions brings together the partitioned semantic networks and the predicted con- cepts which were translated into their surface forms. Through a set of cross-indices, we intersect all pre- dicted concept expressions with all the predicted se- mantic networks. This operation generates dynamically one combined semantic network grammar which em- bodies all the dialogue level and sentence level con- straints. This dynamically created network grammar is used by the parser to process an input utterance.

To illustrate this point, let us assume that the frigate “Spark” has somehow been disabled. We expect the user to ask for its capabilities next. The dialogue track- ing module predicts the “shipname” concept restricted to the value “Spark” and any of the “ship-capabilities” concepts. Single anaphoric reference to the ship is also expected, but ellipsis is not meaningful at this point. The current damage assessment dialogue phase allows queries about features of a single ship.

During the expansion of the predicted concepts, we find the word nets such as “the ship,” “this ship,” “the ship’s, ” “this ship’s,” “it,” “its,” “Spark” and “Spark’s” We also find the word nets for the capabilities such as “all capabilities,” “radar,” “sonar,” “Harpoon,” “Phalanx,” etc. We then intersect these with the sen- tential forms allowed during this dialogue phase. Thus we obtain the nets for phrases like “Does (it, Spark, this-ship, the-ship) have {phalanx, harpoon, radar, sonar}, ” “What (capabilities, radar, sonar) does (the- ship, this-ship, it, Spark) have,” and many more. This semantic network now represents a maximally con- strained grammar at this particular point in the dialogue.

Recognizing Speech Using Dynamic Networks We use the SPHINX system [19] as the basis for our recognizer. SPHINX samples input speech in centi- second frames. Based on the LPC cepstrum coefficients, each frame is then mapped into one of 256 prototype vectors. Vector-quantized speech is also used to train Hidden Markov Models (HMMs) for phonemes. HMMs

are trained from a corpus of approximately 4200 sample utterances. Each word is represented in the dictionary as a single sequence of phonemes. The models for words are pre-compiled by concatenating the HMMs for each phoneme in a word. During recognition, SPHINX performs a time-synchronous beam search known as the Viterbi algorithm, matching word models against the input.

In the MINDS system, we use the active set of seman- tic networks to control word transitions instead of the word-pair constraints normally used by the SPHINX system. The search begins at the set of initia.1 words for all active subnets. This set includes only currently ac- tive words from the dynamically created lexicon for this utterance. As the search matches a word from the input utterance, it transits along the arc in the grammar represented by that word. A score is assigned to each path in the beam, indicating how well the input is matching the HMMs in the path. Paths falling below a threshold score are pruned. The dynamically created semantic network is used to allow only legal word tran- sitions. The network does not affect the score of a path but simply restricts words which can continue a partic- ular path. If no string of words is found which matches the HMMs better than a certain threshold score, a dif- ferent grammar and lexicon from a more ge:neral set of predictions must be used to re-process the utterance. After the spoken input has been processed, -the word string with the best score is passed back to the system for parsing.

When Predictions Fail There are a number of assumptions built into the use of predictions. If a user conforms to our model of a problem-solving dialogue, the advantages are clear. However, we must consider the case when some as- sumptions are violated. There are two points to con- sider when predictions fail: We must first be able to identify the situation of failed predictions and then find a way to recover.

In the MINDS system, the first point is accomplished without extra work. When the user speaks an utterance which was not predicted, the speech recognition com- ponent usually fails to produce a complete parse. The spoken words do not match the predicted words and receive low probability scores. This may not. always be as easy in other recognition systems.

As a mechanism for recovery from failed Ipredictions, the MINDS system always produces several sets of pre- dictions for each utterance. These sets of predictions range from very specific to very general. For the most specific predictions, the system uses all the :possible constraints. Each successive set of predictions then be- comes more general. The number of levels of constraint relaxation depends on the goal tree structure at that point. Predictions are made more general by assuming additional goal nodes are active. Eventually we reach a level of prediction constraints which is identical to the constraints provided by the full semantic grammar with all possible words. We now can parse any syntactically

190 Commut~icatiom of the ACM February 1989 Volume 32 Number 2

Page 9: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

As a mechanism for recovery from failed predictions, the MINDS system always produces several sets of predictions for each utterance.

and semanticall; legal utterance, disregarding all dia- logue considerations. Beyond that we can only relax the constraints to a point where any word can be fol- lowed by any other word. This would be necessary if a user spoke an utterance that was not covered by the grammar. In this case, we must rely on heuristics dur- ing the semantic interpretation of the utterance to pro- vide a correct meaning and database query. Details of this procedure are described in [38].

When the speech recognition module fails to parse at a particular level of constraints, the next set of predic- tions is used to reparse the same utterance until a suc- cessful parse is obtained. If the user is cooperative and within our predictions, recognition accuracy will be high and response time immediate. As the system backs up over several levels of constraints, the search space of the recognition module becomes larger and processing time increases while accuracy drops. However, the sys- tem never experiences a complete loss of continuity when predictions are violated.

EVALUATION OF PROGRESS Many systems developed by researchers in the arti- ficial intelligence community lack a rigorous evalua- tion. While the individual systems may incorporate brilliant ideas, it is rarely shown that they are in some way better than other systems based on a different approach. If the research in a field wants to make progress, that progress must be made visible and measurable.

In the field of speech understanding one clear mea- sure of success is recognition accuracy. Recognition ac- curacy can be measured in terms of word accuracy, sentence accuracy as well as semantic accuracy. Word accuracy is defined here as the number of words that were recognized correctly divided by the number of words that were spoken. In addition to the number of correct words, we also record the number of insertions of extra words by the recognizer. This number is other- wise not reflected in the percentage of correct words. Recognition accuracy thus takes into account deleted words and word substitutions (i.e., “its” was spoken but “his” was recognized.) On the other hand, error rate reflects insertions, deletions, and subtitutions. If the speech recognizer makes minor errors in recognizing an utterance, but the underlying meaning of the utterance is preserved, the utterance is considered to be recog- nized semantically accurate even though some words were incorrect. Semantic accuracy therefore is the per- centage of sentences with correct meaning. In our sys- tem, a sentence is considered semantically correct if the recognition produces the correct database query.

To test the ability of the MINDS system to reduce search space and improve speech recognition perfor- mance, we performed two experiments. The first exper- iment assessed search space reduction caused by pre- dictive use of all pragmatic knowledge sources. The second experiment measured improvement in recogni- tion accuracy rates resulting from the use of layered predictions. Both studies used a test set of data which was independent from the training data used to develop the system. This means that the utterances and dia- logues processed by the system to obtain the experi- mental results had not been seen previously by the system or the developers.

Our test data consisted of 10 problem solving sce- narios. These were adapted versions of three actual transcripts of naval personnel solving problems caused by a disabled vessel. The personnel must determine whether to delay a mission, find a replacement vessel or schedule a repair for a later date. They use a data- base to find necessary problem solving information, In addition, we created seven additional scenarios by par- aphrasing the original three. The test scenarios con- tained an average of nine sentences with an average of eight words each. An excerpt of a dialogue sequence is given in Figure 1.

The training data had consisted of five different problem solving scenarios from transcripts of naval per- sonnel performing the same basic task. The training scenarios were used for developing the user models. Dialogue phases, goals and problem solving plans were derived from an abstract description of the stages and options available to a problem solver. The abstract plan descriptions had been provided by the Navy.

Since our database was different from the one used in gathering the original transcripts, we were forced to adapt all scenarios. Lexical items which were unknown to our system were substituted with known words. Shipnames, locations, capabilities, mission require- ments, etc. were changed to be consistent with our database. We feel these adaptations had minimal im- pact on the integrity of the data and did not alter the problem solving structure of the task. The lexicon for this domain contained 1,000 words.

Reduction of Search Space and Perplexity Since the magnitude of the search space is such a criti- cal factor in speech recognition, one measure of success is the reduction in search space provided by a system. To measure the constraint imposed by the knowledge sources we use two measures: perplexity and search space reduction. Perplexity is an information theoretic measure that is widely used in speech systems to char- acterize the constraint provided by a grammar. Perplex-

February 1989 Volume 32 Number 2 Communications of the ACM 191

Page 10: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

ity consists of the geometric mean of the number of nodes which are visited during the processing of an utterance. In our case, we use the semantic network grarnmars to calculate the number of word alternatives which the system has to consider. Test set perplexity is computed specifically for actual utterances. After we compute the alternatives for a word in the utterance, we a,ssume the system recognizes this word correctly and continues by only computing the alternatives which directly follow this word in the grammar. A more detailed justification of this measure is given in [17]. The size of the search space is calculated by rais- ing the sentence perplexity value to the number of words in the sentence.

Our first experiment was designed to test the per- plexity and search space reduction resulting from ap- plying pragmatic constraints. To measure the reduction in perplexity and search space we collected test set perplexity measurements for each of the parsed sen- tences in two conditions. The first condition repre- sented the constraints provided by the complete se- mantic grammar networks with the full vocabulary available. The second condition measured perplexity for the most specific set of predictions that could be applied. The estimate for the second condition is the perplexity obtained by merging the successful predic- tion level with all of the more specific but unsuccessful levels of constraints. Otherwise, the results would be misleading whenever the predictions were not fulfilled.

As seen in Table I, by applying our best constraints, test set perplexity was reduced by an order of magni- tude, from 279.2 to 17.8 while search spaces decreased by roughly 10 orders of magnitude.

TABLE I. Reduction in Branching Factor and Search Space

Test Set Perplexity 279.2 17.8 Search Space 3.81 x lOi 1.01 x lo9

Improvements in Recognition Accuracy To evaluate the effectiveness of using predictions on recognition performance we used 10 speakers (8 male, 2 female) who had never before spoken to the system. To assure a controlled environment for these evalua- tions, each speaker read 20 sentences from the adapted test scenarios provided by the Navy transcripts. Each of these utterances was recorded. The speech recordings were then analyzed by the MINDS system in two condi- tions. The first condition ignored all constraints except those provided by the complete semantic grammar. In other words, all possible meaningful sentences were ac- ceptable at all times. The second condition used the MINDS system with the most specific set of predictions appropriate for the utterance.

To prevent confounding of the experiment due to misrecognized words, the system did not use its normal

speech recognition result to change state. Instead, after producing the speech recognition result, the system read the correct recognition from a file which con- tained the correct set of utterances. Thus, the system always changed state according to a correct analysis of the utterance.

The results can be found in Table II. The system performed significantly better with the predictions. Error rate decreased from 17.9 percent to 3.5 percent. Perhaps just as important is the nature of the individ- ual errors. In the condition with the most specific suc- cessful predictions, almost all of the errors Ifinsertions and deletions) were made on the word “the.” Another large proportion of errors consisted of substituting the word “his” for the word “its.” Furthermore, none of the errors in the “with predictions” condition resulted in an incorrect database query. Hence, semantic accu- racy was 100 percent on this sample of 200 spoken sentences.

TABLE II. Recognition Performance

Test Set Perplexity Word Accuracy Insertions Semantic Accuracy Deletions Substitutions Error Rate

242.4 18.3 82.1% 97.0%

0.0% 0.5% 85% 100%

8.5% 1.6% 9.4% 1.4%

17.9% 3.5%

CONCLUSIONS It is obvious that the MINDS system represents only a beginning in the integration of speech recognition with natural language processing. We have shown how one can apply various forms of dialogue level knowledge to reduce the complexity of a speech recognition task. Our experiments demonstrated the effectiveness of the added constraints on the recognition accuracy of the speech system. We have also demonstrated that specific predictions can fail and the system will recover grace- fully using our mechanism for gradually relaxing constraints.

For this domain, we hand-coded all the goal trees and grammars into the knowledge sources of the system. For larger domains and vocabularies it would be desira- ble to automate the process of deriving the goal trees and grammars during interactions with the initial users. There is much more work needed on automatic modeling of human problem solving processes based on empirical observation.

We do not claim that these exact results ishould be obtainable in any domain or any task. Rather it was our intent to demonstrate the usefulness of dialogue level knowledge for speech recognition. Future spoken lan- guage systems dealing with larger domains and very large vocabularies will be well advised to consider in-

192 Communications of the ACM February 1989 Volume 32 Number 2

Page 11: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

corporating the kinds of mechanisms described in this article.

Acknowledgements. We are indebted to Raj Reddy, who chaperoned this research effort.

REFERENCES 1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

Allen, J.F.. and Perrault. CR. Analyzing intention in utterances. Art. I?lfd. 15, 3 (1980). 143-178. Bahl, L.R., Jelinek, F., and Mercer, R.L. A maximum likelihood ap- proach to continuous speech recognition. IEEE Trans. Pati. Anal. and Mach. Intell. 5. 2 (19831, 179-190. Barnett, J. A vocal data management system. ZEEE Trans. Audio and Elccfroacoustics AU-Z], 3 (June 1973). 185-186. Biermann. A., Rodman R., Ballard, B.. Betancourt. T., Bilbro. G.. Deas. H.. Fineman. L.. Fink. P.. Gilbert. K., Gregory, Lt., and Heidlage. F. Interactive natural language problem solving: A prag- matic approach. In Proceedings of the Conference on Applied Nafural Language Processirrg (Santa Monica, Calif.. Feb. 1-3. 1983). pp. 180-191. Borghesi, L.. and Favareto. C. Flexible parsing of discretely uttered sentences. COLING-82. Association for Computational Linguistics, (Prague. July, 1982). pp. 37-48. Chapanis, A. Interactive human communication: Some lessons learned from laboratory experiments. In Shackel, B., Ed.. Man- Compuler Interaction: Humn Factors Aspects of Computers and People, Sijthoff and Noordhoff, Rockville. Md.. 1981. pp. 65-114. Chin, D.N. Znfelligenf Agents as a Basis for Natural Language Interfaces. Ph.D. dissertation, Computer Science Division (EECS). University of California (Berkeley). 1988. Report No. UCB/CSD 88-396. Cohen, P.R.. and Perrault, CR. Elements of a plan-based theory of speech acts. Cog. Sci. 3 (1979) 177-212. Ermen, L.D., and Lesser, V.R. The Hearsay-11 speech understanding system: A tutorial. In W.A. Lea (Ed.] Trends in Speech Recognition. Prentice-Hall, En&wood Cliffs, N.J.. 1980. Fink, P.E., and Biermann, A.W. The correction of ill-formed input using history-based expectation with applications to speech under- standing. Compuf. Ling. 72, 1 (1986). 13-36. Frederking, R.E. Natural Language Dialogue in an Integrated Computa- tional Model. Ph.D. dissertation, Department of Computer Science. Carnegie-Mellon University, Pittsburgh, PA, 1986. Tech Rep. CMU-CS-86-178. Gatward, R.A., Johnson, S.R.. and Conolly. J.H. A natural language processing system based on functional grammar. Speech Input/Out- put: Techniques and Applications, Institute for Electrical Engineers, 1986. pp. 125-128. Grosz, B.J. The representation and use of focus in dialogue under- standing. SRI Stanford Research Institute. Stanford, CA, 1977. Hauptmann, A.G.. Young, S.R., and Ward, W.H. Using dialog-level knowledge sources to improve speech recognition. In Proceedings of AAAI-88, The 7fh Nafional Conference on Artificial Intelligence, Ameri- can Association for Artificial Intelligence. 1988. pp. 729-733. Saint Paul, MN. Hauptmann, A.G.. and Rudnicky, AI. Talking to computers: An empirical investigation. Infernational \. Man-Machine Studies (in press] (1988). Hayes, P.J., Hauptmann. A.G., Carbonell. J.G., and Tomita, M. Pars- ing spoken language; A semantic caseframe approach. In Proceedings of COLING-86, Association for Computational Linguistics, Bonn. Germany, August, 1986. Kimball, 0.. Price, P.. Roucos, S., Schwartz, R., Kubala. F.. Chow, Y.-L., Haas. A., Kramer. M. and Makhoul. J. Recognition perfor- mance and grammatical constraints. In Proceedings of the DARPA Speech Recognifim Workshop, Science Applications International Cor- poration Report Number SAIC-86/1546, 1986. pp. 53-59. Lea, W.A. (Ed.]. Trends in Speech Recognition. Prentice-Hall, Engle- wood Cliffs, N.J., 1980. Lee, K-F. Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx System. Ph.D. dissertation, Department of Computer Science, Carnegie-Mellon University, Pittsburgh. PA, 1988. Tech Rep. CMU-CS-88.148. Levinson, SE.. and Rabiner, L.R. A task-oriented conversational mode speech understanding system. Bibliofheca Phonetica 12 (1985). 149-196. Levinson, SE., and Shipley, K.L. A conversational-mode airline in- formation and reservation system using speech input and output. The Bell Systems Technical \ournal 59 (1980), 119-137.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

Litman, D.J., and Allen, J.F. A plan recognition model for subdia- logues in conversation. Cog. Sri. 11, 2 (1987). 163-200. Lowerre, B. and Reddy, R. The Hearsay Speech Understanding Sys- tem. In Trends in Speech Recogrlifion, W.A. Lea (Ed.). Prentice-Hall. Englewood Cliffs. N.J.. 1980. Newell, A. and Simon, H.A. Human Problem Solving. Prentice-Hall, Englewood Cliffs, N.J.. 1972. Reddy, R.. and Newell, A. Knowledge and its representation in a speech understanding system. In Knowledge and Cog&ion, Gregg L.W., Ed. L. Erlbaum Associates, Potomac, Md., 1974. pp. 256-282. Sacerdoti. E.D. Planning in a hierarchy of abstraction spaces. Artif. Intell. 5. 2 (1974), 115-135. Sidner, C.L. Focusing for interpretation of pronouns. Amer. 1. Comput. Ling. 7. 4 (Oct.-Dec. 1981). 217-231. Stern, R.M., Ward, W.H., Hauptmann, A.G., and Leon, J. Sentence parsing with weak grammatical constraints. ICASSP-87. 1987. pp. 380-383. Tomabechi, H.. and Tomita, M. The integration of unification-based syntax/semantics and memory-based pragmatics for real-time un- derstanding of noisy continuous speech input. In Proceedings of AAAI-88, The 7th National Conference on Artificial Intelligence, Ameri- can Association for Artificial Intelligence, 1988, pp. 724-728. Saint Paul, MN. Walker, D.E. SRI research on speech recognition. In W.A. Lea (Ed.) Trends in Speech Recognition. Prentice-Hall, Englewood Cliffs, N.J., 1980. Ward, W.H.. Hauptmann, A.G., Stern, R.M.. and Chanak, T. Parsing spoken phrases despite missing words. ICASSP-88, 1988. Wilensky. R. Understanding goal-based stories. Ph.D. dissertation. Yale University, Sept. 1978. Wilensky, R.. Plamring and Undersfanding. Addison Wesley, Reading, Mass., 1983. Winograd, T. Language as a Cognitive Process, Volume I: Syntax. Addison Wesley, Reading, Mass., 1982. Wolf, J.J. and Woods, W.A. The HWlH Speech Understanding Sys- tems. In W.A. Lea (Ed.) Trends in Speech Recognition. Prentice-Hall, En&wood Cliffs, N.J., 1980. Woods, W.A.. Bates, M., Brown, G., Bruce, B.. Cook, C., Klovstad, J., Makhoul, J., Nash-Webber. B., Schwartz, R.. Wolf, J,. and Zue, V. Speech understanding systems-Final technical report. Tech. Rep. 3438, Bolt, Beranek. and Newman. Inc., Cambridge, Mass., 1976. Young, S.R.. Hauptmann, A.G.. and Ward, W.H. An integrated speech and natural language dialog system: Using dialog knowledge in speech recognition. Department of Computer Science, CMU-CS- 88-128, Carnegie-Mellon University, April, 1988. Young, S.R., and Ward, W.H. Towards habitable systems: Use of world knowledge to dynamically constrain speech recognition. 2d Symposium ou Advanced Man-Machine Interfaces fhrough Spoken Language, Hawaii. Nov., 1988. (submitted].

ABOUT THE AUTHORS:

SHERYL R. YOUNG is a research faculty member of the

Computer Science Department at Carnegie Mellon University. She has a B.A. in math and psychology from the University of Michigan and a Ph.D. in cognitive science/psychology from the University of Colorado.

ALEXANDER G. HAUPTMANN is working toward a Ph.D. in computer science at CMU. He has a B.A. and M.A. in psychol- ogy from Johns Hopkins University and a Diploma (M.A.) in computer science from the Technische Universitaet Berlin in West Germany.

WAYNE H. WARD is a research associate in the CMU Com- puter Science Department. He has a B.A. in mathematical sci- ence from Rice University and a Ph.D. in psychology from the

University of Colorado. Authors’ present address: Young, Hauptmann and Ward, Computer Science Department, Carne- gie Mellon University, Pittsburgh, PA 15213.

Februa y 1989 Volume 32 Number 2 Communications of the ACM 193

Page 12: HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION ...lastchance.inf.cs.cmu.edu/alex/cacmMinds.pdf · Understanding speech is a difficult problem. The ulti- mate goal of all

Articles

EDWARD T. SMITH is president of Greenfield Educational Software, 1014 Flemington St., Pittsburgh, PA 15217. His re- search interests include small educational simulations for grade (and high school students.

PHILIIP WERNER is a software developer for MAD Intelligent Systems in Cambridge, Massachusetts. He has an M.Sc. in cog nitive science from the University of Edinburgh and held a

research assistantship while attending CMU as a graduate student.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commer- cial advantage, the ACM copyright notice and the title of the publication and its date appear. and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise. or to republish. requires a fee and/or specific permission.

Provides in-depth examination of the topics covered...

acm Computing Surveys

Editor-in-Chief Salvatore T. March University of Minnesota, Mintleapolis, MN

A uthoritative surveys and tutorials make ACM Computirzg Swueys a required re- source in computer science. Refer to Computiq Slrrveys for updates, new

perspectives on hardware and software, computer systems organization, computer science theory, artificial intelligence, applications, and a spectrum of peripheral topics.

In the past, Cornptrtin~ Surveys has treated encryp- tion and data security, the legal issues involved in privacy, fault-tolerant software, data management

Included in Ap$ied Science b Techrzology Index, Matlzematical Reviews, Scierlce Abstracts, Science Abstracts Ztldex, British Maritime Techolo~y Ltd., Computer Literature Itldex (formerly Qfr. B/b/i. Conzp. 0 Data hoc.), Computirq Reuiem, Compmlath Citation Index (CMCZ), Ergonomics Abstracts, Informafioll Services for the Physics & Eq+eerirzg Com~z~r~ities, Index to Scientific Review.

Order No. 105000 Subscriptions: $75.00/year - Mbrs. $11.00 Single Issues: $17.00 - Mbrs. $9.00 Back Volumes: $68.00 - Mbrs. $36.00 Student Mbrs. $25/year

and organization, information systems, and man- machine interface software packages.

What’s best is that the prose is lively and accessible while providing an in-depth examination of the topics covered. Published quarterly. ISSN: 0360-0300

Please send all orders and inquiries to:

P.O. Box 12115 Church Street Station New York, N.Y. 10249

194 Communications of the ACM February 1989 Volume 3;! Number 2