9SPOKEN LANGUAGE SYSTEMS
JUPITER Data Collection and AnalysisJoseph Polifroni, James Glass and Sally Lee
We have been actively collecting data within
the JUPITER domain since the beginning of
1997. As was done in previous domains, we
first developed a prototype JUPITER system
and used it to collect spontaneous speech
using a Wizard paradigm, with a human
typist in the loop and subjects brought into
the lab and given scenarios to solve. At the
same time, we solicited read speech using
both our Web-based data collection facility
and a phone number that subjects could call
to read from pre-distributed lists.
Once these data had been collected, we
were able to train a recognizer and move on
to system-based data collection. We cur-
rently have a toll-free number that is
available 24-hours/day for subjects to call to
find out weather information1. The
utterances collected from this facility are
also used as training data. The toll-free
number has been a particularly powerful
method for collecting data from a variety of
subjects in a short period of time. We feel
that these calls accurately reflect the way
users want to interact with such systems.
Data Collectionand TranscriptionIn recent months, we have been receiving
approximately 17 calls per day. In order to
ensure that these data are ready to use as
quickly as possible for both speech recogni-
tion and natural language development, we
have been processing incoming data on a
daily basis. Every morning a script automati-
cally sends email containing a list of the
previous days calls. These calls are then
manually transcribed, usually over the
course of the following day. The transcribed
calls are then bundled into sets containing
approximately 500 utterances and are added
to the training corpus as they become
available.
Over the past year, we have continued
to refine our transcription tool which was
originally developed for orthographic
transcription of read speech. We used a Tcl/
Tk interface to provide an editable window
where the transcriber listen to utterances
and correct existing transcriptions and add
specialized markings for noise, partial words
etc. The initial transcription for each
utterance was the orthography hypothesized
by the recognizer during the call. The
transcriber could also identify the talker as
male, female, or child using the transcrip-
tion tool.
The transcription tool uses a lexicon to
check the quality of orthographic transcrip-
tions. This feature is useful for finding
typographical errors before they are saved. If
a word does not appear in the lexicon, the
transcriber is warned and given the option
of either adding the word to the lexicon or
changing it in the orthography file. The use
of the lexicon also allowed us to monitor
the growth of new words in the domain.
Corpus AnalysisA breakdown of the current utterances in
our JUPITER corpus is shown in Table 1. The
live data were collected between May 1997
and July 1998. The read speech data
collection contained a list of 50 utterances
which users were asked to read. The wizard
data collection effort involved subjects
attempting to solve several different
scenarios and thus produced a considerable
number of utterances per user. Note that
the wizard data was the most time consum-
1 Dissemination of the toll-free number has been done exclusively by word-of-mouth and the SLS Group’shomepage.
10 SUMMARY OF RESEARCH
ing to collect. In contrast, the read data
collected via the toll-free number tends to
have many fewer utterances per call, as users
are typically interested in weather informa-
tion for one particular location. In addition,
the queries by real users tend to be some-
what shorter, although this statistic is
probably skewed by utterances which do not
contain an actual query due to a false
triggering of the endpoint detector.
A breakdown of the live data shows that
just over 80% of users are male speakers,
with females comprising approximately 18%
of the data, and children the remainder.
A portion of the data was from non-native
speakers, although the system performed
adequately on speakers whose dialect or
accent did not differ too much from general
American English. Callers with strong
accents, however, have been thus far
excluded from training or test sets for both
the speech recognition and natural language
components. These data constituted
approximately 7% of the calls and 14% of
the data, and will be useful for future study.
A very small fraction (0.1%) of the utter-
ances included talkers speaking in a foreign
language (e.g., Spanish, French, German,
and Chinese).
The signal quality of the data varied
substantially depending on the handset, line
conditions, and background noise. Al-
though clearly an underestimate, speaker
phones were clearly used in approximately
5% of the calls due to the presence of
multiple talkers in an utterance. Only a
small fraction of the data (0.5%) was
estimated to be from cellular or phones.
Over 11% of the data contained
significant noises. About half of this noise
was due to cross-talk from other speakers,
while the other half was due to non-speech
noises. The most common identifiable non-
speech noise was due to the user hanging up
the phone at the end of a recording (e.g.,
after saying good bye). Other distinguish-
able sources of noise included (in descend-
ing order of occurence) TV, music, phone
rings, touch tones etc. These data were also
excluded from training and testing of the
speech recognition component, although
cleaned up orthographies were used for
natural language training.
There were a number of spontaneous
speech effects present in the recorded data.
Over 6% of the data included filled pauses
(e.g., uh, um, etc.) which were explicitly
modeled as words in the recognizer, since
they had consistent pronunciations, and
occured in predictable places in utterances.
Filled pauses were removed from the
sentence hypotheses sent to the natural
language component. Utterances containing
partial words contitute another 6% of the
data, although approximately two thirds of
these were due to clipping at the beginning
or end of an utterance. The remaining
artifacts were contained in less than 2% of
the data and included phenomenon such as
(in descending order of occurence) laughter,
Table 1.Calls Utts Type Av Words/Utt Utt/Call
13168 74697 live 5.6 5.7
25 956 wizard 7.5 38.2
71 3519 read 9 49.6
3772 25175 total 6.1 6.7
11SPOKEN LANGUAGE SYSTEMS
throat clearing, mumbling, shouting,
coughing, breathing, sighing, sneezing, etc.
The latter data and all clipped data were
also excluded from training or testing of the
speech recognition component.
The data from foreign speakers,
containing cross-talk or other noises, and
containing clipped speech and other
spontaneous phenomenon (excluding filled
pauses), combined to account for approxi-
mately one quarter of all recorded data. To
date, these data have not been used for
training or testing the speech recognizer,
despite the fact that the system often
performs quite well on these data during an
actual call. The reason we have avoided
these data to date is due to the concern that
the recognizer will produce poor alignments
during training which will ultimately
contaminate the acoustic-phonetic models.
We hope to test this hypothesis and
examine these data more closely in the near
future.
12 SUMMARY OF RESEARCH
Natural Language Processingin the JUPITER DomainStephanie Seneff and Joseph Polifroni
Parsing Linguistic ContentWithin the JUPITER system, we have begun to
explore the possibility of using natural
language systems to parse textual informa-
tion recovered from various Web sites, in
order to facilitate selection of the appropri-
ate message to speak in response to specific
user requests. Weather forecasts are
obtained from multiple sites on the World
Wide Web, whose information content
complement one another. If the system fully
understands the information it is presenting
to the user, it will be able to select subsets
that are relevant to the question, avoiding
boring the user with long monologues on
irrelevant distractions. It will also be able to
recognize when two sources are providing
overlapping information, and can select one
or the other based on quality consider-
ations. Finally, by representing all linguistic
data in a language-transparent semantic
frame format, the notion of translating the
weather reports into other languages
becomes far more feasible. An example
interaction with JUPITER displaying some of
its dialog capabilities and understanding of
specific weather information is shown in
Figure 2.
Figure 2. Example spoken dialoguebetween JUPITER and a user. J: Hi! Welcome to JUPITER, the MIT Lab for Computer Science weather
information system. You may hang up or ask for help at any time. How can Ihelp you?
U: CAN YOU TELL ME WHAT’S THE WEATHER LIKE IN BOSTON TODAY
J: In Boston this afternoon. Partly sunny early. Then becoming cloudy with 30percent chance of showers toward evening. Tonight: rain. Possibly heavy attimes. Near 100 percent chance of rain. Can I help you with something else?
U: YES WHAT ABOUT SAN FRANCISCO
J: In San Francisco today: patchy fog near the ocean morning. Otherwisemostly sunny. Tonight: areas of coastal fog and low clouds. Otherwise fair. Isthere something else I can do for you?
U: YES HOW COLD IS IT IN DALLAS TEXAS TODAY
J: In Dallas today: high upper 40s. Tonight: lows near 32. What else would youlike to know?
U: I’D LIKE TO KNOW WHETHER IT’S SNOWING IN DETROIT
J: I’m sorry, I didn’t understand you. Please rephrase your query.
U: WHAT IS THE WEATHER IN DETROIT
J: In Detroit this afternoon: occasional light rain and drizzle. Tonight: rainshowers likely. Becoming snow showers by late tonight. 70 percent chance ofprecipitation.
13SPOKEN LANGUAGE SYSTEMS
While most of our sources contain
scripted phrases that are predictable in
format, the National Weather Service
(NWS) provides manually generated
forecasts that are quite rich in linguistic
variation. Some example sentences are
shown in Table 2. Although these data are
challenging from the standpoint of parse
coverage, they give a rich description of the
weather, including very descriptive accounts
of conditions over time and locale, predic-
tions of amounts of precipitation, advisories
for hurricanes, floods, etc. Thus, we believe
it to be worth the effort to process them.
After the first three months of develop-
ing a grammar for the NWS data, the
system had achieved a high coverage (over
99%) on incoming weather reports. Any
sentences that fail to parse can be rephrased
by the system developer to retain an
equivalent meaning. The grammar can later
be expanded to encompass a broader base.
This is an ongoing process. By requiring a
full parse and hand-editing sentences that
fail to parse, we can guarantee a high
reliability in the understanding and
regeneration. We feel it is important to
assure, as much as possible, that the system
rarely provides incorrect or garbled informa-
tion, which would be much more likely with
any robust parsing strategies. The current
grammar contains over 1100 unique words,
with about 800 categories, about half of
which are pre-terminal.
JUPITER is updated several times a day, by
polling the Web for any changes in predic-
tions. An automatic procedure parses the
data into semantic frames [1], and a second
process sorts them into categories based on
the meaning. In general, each sentence is
indexed under generic terms such as
“weather,” “temperature,” and “advisory,”
and under more specific terms such as
“rain,” “fog,” “snow,” and “hurricane,” as
well as by date (“today,” “tomorrow,” etc.),
and by location. To retrieve the answer to a
particular user request, the system first
retrieves the indices of the relevant sen-
tences in the weather report via an SQL
query filtering on the particulars of the user
request, and then orders them sequentially.
Finally, each of the semantic frames is
paraphrased in turn, to compose a verbal
response. Delays are minimal, since the
system has preprocessed all current informa-
tion into semantic frames in a local cache.
We have begun an effort to port the
system to three other languages besides
English: German, Mandarin Chinese, and
Spanish. For delivery of the information in
a foreign language, generation tables are
being prepared that mirror the tables for
“Increasing clouds and becoming breezy with showers likely and a chance ofthunderstorms by the afternoon”
“Near record low temperatures with a low from upper 20s north portions tonear 40 south sections near Del Rio”
“The National Weather Service has continued the excessive heat warning for highly urbanized areas for Friday”
“Some storms may be severe with damaging winds and isolated tornadoes”
Table 2. Some examples ofsentences from the NationalWeather Service web site.
14 SUMMARY OF RESEARCH
English, but specify alternative phrase
ordering rules and vocabulary entries. For
each of these languages, a native speaker
who is fluent in English is preparing the
corresponding GENESIS generation rules [2].
German is a particularly difficult
language due to its extensive use of inflec-
tional endings. In particular, it was neces-
sary to augment GENESIS with a more
sophisticated ability to deal with case, which
can be assigned in the vocabulary file by
prepositions and verbs. In addition, we
needed to be able to specify the correct
inflectional endings for nouns and adjec-
tives as a function of case, gender, and
number.
We “closed the loop” on a complete
Spanish system this year, using acoustic
models from the GALAXIA system, developed
in 1996. Although much work remains in
all aspects of the system, we were pleased to
have a system that would accept Spanish
input and speak a Spanish reply (using the
Spanish version of TruVoice, a software
speech synthesizer from Centigram Corpo-
ration). Parsing of Spanish input was
achieved using TINA, which needed no
modification to system code to produce
semantic frames from Spanish input.
Although Spanish has more complicated
number and gender agreement than
English, GENESIS had no trouble producing
grammatically correct utterances in Spanish.
The additional work that we had done to
accomodate German, with an even more
complex agreement system incorporating
case meant that the mechanisms in GENESIS
for making such agreement had been
thoroughly exercised.
Figure 3 gives an example of a semantic
frame for the sentence, “2 to 4 inches
snowfall accumulation by morning,” along
with the corresponding paraphrases in three
languages. Note that the preposition “by”
has been interpreted in the semantic frame
as denoting a time expression, allowing the
appropriate translation of this diversely
realized preposition.
NATURAL LANGUAGE PROCESSING IN THE JUPITER DOMAIN
Figure 3. Example sentence andsemantic frame along withparaphrases in English, Germanand Spanish.
clause: weather_event topic: accumulation name: snowfall pred: amount topic value, name: 2 pred: to_value topic: value, name: 4, units: inches pred: by_time pred: time_interval topic: time_of_day, name: morning
Input: 2 to 4 inches snowfall accumulation by morningEnglish: snowfall 2 to 4 inches by morningGerman: Schneefall 2 bis 4 Inch bis am MorgenSpanish: nevada 2 a 4 pulgadas antes de la manana
15SPOKEN LANGUAGE SYSTEMS
N-best Selection Basedon Natural LanguageUnderstandingThe recognizer for JUPITER produces an N-
best list of candidate utterances, which are
passed to the natural language system for
final selection. We are exploring new
methods for selection, which include an
efficient strategy that parses a word graph
regenerated from the N-best list, and
alternative robust-parsing strategies,
including an agressive one that, upon total
parse failure, simply detects dominant
content words in the N-best list and
operates analogously to a word-spotting
algorithm. We are investigating ways to
combine the linguistic and the recognizer
scores, and have also incorporated a
mechanism to influence selection based on
the response to the preceding query. For
instance, when the system reads off a list of
cities in a particular region, it gives priority
to any candidate hypotheses that contain
one of those cities.
Dialogue ModellingWe have discovered several interesting issues
with regard to dialogue modelling to
accommodate users’ requests, and we are
becoming increasingly aware of the benefits
in letting real users influence the design of
the interaction. One of the critical aspects
of any conversational interface is the need
to inform the user of the scope of the
system’s knowledge. JUPITER only knows
approximately 500 cities, and users need to
be directed to select relevant available data
when their explicit request yields no
information. Even for the cities it knows,
JUPITER does not necessarily have the same
knowledge for all cities.
JUPITER has a separate geography table
organized hierarchically, enabling users to
ask questions such as “What cities do you
know about in the Caribbean?” This table is
also used to provide a means of summariz-
ing a result that is too lengthy to present
fully. For example, if the user asks where it
will be snowing in the United States, there
may be a long list of cities expecting snow.
The system then climbs a geographical
hierarchy until the list is reduced to a
readable size. For example, JUPITER might list
the states where it is snowing, or it might be
required to reduce the set even further to
broad regions such as “New England.”
We have implemented a BIANCA-style
control strategy for JUPITER (c.f. page 29),
which has made it relatively easy to visualize
the control flow, and to increase the
complexity of the dialogue capabilities. We
are in the process of regularizing the top-
level shell program for JUPITER and PEGASUS
towards a goal of providing a generic
Domain Server shell that can accelerate the
development of any future knowledge
domains.
EvaluationJUPITER is both an on-line system that
receives an average of 500 calls per month
and a research system that is used as a
testbed in areas such as displayless interac-
tion, virtual browsing, information on
demand, and translingual content manage-
ment. As we have used JUPITER to further
our research agenda, we have become
concerned that we have reliable evaluation
metrics for all JUPITER system components, to
ensure that new releases are adequately
evaluated before becoming part of the on-
line JUPITER system. Each separate compo-
STEPHANIE SENEFF AND JOSEPH POLIFRONI
16 SUMMARY OF RESEARCH
nent of JUPITER is in a state of active develop-
ment; furthermore, the components
interact with each other (e.g., items in focus
are communicated from the JUPITER back-
end to the natural language component for
use in filtering incoming queries). Before
incorporating any changed version of an
individual component into the system, it is
necessary that we measure the effects of the
changes, both in the particular component
being updated and in the entire system. In
addition, to verify that we are making
progress with each update, we need a
consistent metric that can be applied across
time and across system components, in
addition to metrics for each individual
component. Over the past year, we have
refined a methodology for comparing
semantic frames, a meaning representation
produced by the natural language compo-
nent and used by many other system
components.
In order to compare semantic frames,
we must first establish a baseline set of
semantic frames that can be considered
“correct.” To do this, the evaluation
program is run in a mode that creates a
reference file of utterances, with their
associated semantic frame representations
and their system-generated paraphrases.
This reference file can be created with or
without incorporating discourse context,
enabling the evaluation to be conducted
under either condition.
Once a reference has been created,
system modifications can be tested against
this reference by running the evaluation
program in comparison mode. Since we
have adopted a hierarchical structure for the
semantic frame, any comparison must
recurse through both the reference and new
semantic frames and compare the individual
sub-frames that make up each. In addition,
not all of the differences between semantic
frames are critical for proper understanding
(e.g., the difference between an indefinite
and a definite article does not affect queries
to the JUPITER back-end), so the comparison
must know what is critical for understand-
ing and what can be ignored. Because
changes in the rules that govern parsing or
discourse inheritance can have unforeseen
consequences, this program can be used to
flag any changes that were either intentional
or inadvertently introduced as a conse-
quence of system modifications. When a
new version of the system is ready for
release, a new reference set is created for the
same set of test utterances.
To detect any newly introduced genera-
tion errors, we maintain a system generated
paraphrase for each frame in the reference
set. The outputs of the newest version of
the generation component can then be
compared against the stored templates, and
a human evaluator is invoked to judge
whether all observed changes are intended.
We have also used the semantic frame
measure to evaluate our N-best selection
procedure, which incorporates both
rejection and robust parsing. We have
explored several different strategies for N-
best selection, and found it productive to
compare the effectiveness of different
strategies by matching semantic frames,
both between two different hypothesized
solutions and between each solution and
the reference transcription. For example, we
have used automatic procedures based on
matching semantic frames as knowledge
sources for refining the parameter selection
process for rejection.
To determine if this automatic meaning
evaluation can be used to test something
NATURAL LANGUAGE PROCESSING IN THE JUPITER DOMAIN
17SPOKEN LANGUAGE SYSTEMS
that was previously done by human
evaluators, we conducted an experiment in
which we compared the automatic semantic
frame comparison algorithm for under-
standing against performance results
obtained from a subjective evaluation of log
files recorded from the original dialogue
with the user. These two methods appeared
to give very similar ratings, suggesting that
our automatic techniques are valid.
References[1] S. Seneff, “TINA: A Natural Language System for
Spoken Language Applications,” ComputationalLinguistics, vol. 18, no. 1, pp. 61-86, 1992.
[2] J. Glass, J. Polifroni and S. Seneff, “MultilingualLanguage Generation Across Multiple Domains,”Proc. International Conference on Spoken LanguageProcessing, pp. 983-986, Yokohama, Japan,September 1994.
STEPHANIE SENEFF AND JOSEPH POLIFRONI
18 SUMMARY OF RESEARCH
Spontaneous Speech Recognitionin the JUPITER DomainJames Glass
The main focus of speech recognition
activity over the past eighteen months has
been on the JUPITER telephone-based
weather information domain. Beginning in
February and March 1997, we created an
initial corpus of approximately 3,600 read
utterances, augmented with over 1,000
utterances collected in a wizard environ-
ment. These data were used to create an
initial version of a recognizer which could
be used to automatically collect data from
users from a toll-free number. This system
was deployed in late April 1997, and has
provided a continuous source of data since
that time. These data have been used to
periodically improve the acoustic and
language models used by the recognizer.
Over the course of telephone data collec-
tion for this year, we have been able to
reduce the word error rate by a factor of
three. The following sections describe the
various components of the recognizer in
more detail.
VocabularyThe vocabulary used by the JUPITER system
has evolved over the course of the year as
periodic analyses were made of the growing
corpus. By the summer of 1997 the vocabu-
lary had stabilized to 1345 words including
over 500 cities, and 150 countries. Table 3
shows a breakdown of the various types of
words in the current vocabulary. Note that
over half of the vocabulary contains
geography related words.
The design of the geography vocabulary
was based on the cities for which we were
able to provide weather information, as well
as commonly asked cities. Other words were
incorporated based on frequency of usage
and whether or not the word could be used
in a query which the natural language
component could understand. The 1345
words had an out-of-vocabulary (OOV) rate
of 3.6% on a 1445 utterance test set. Table 4
shows some of the most frequently occur-
ring OOV words as of the end of 1997,
along with some example usages of the
word. The current vocabulary will be
expanded to incorporate many of these
words in the coming year.
Since the recognizer made use of a
bigram grammar in the forward Viterbi
pass, several multi-word units were incorpo-
rated into the vocabulary to allow for
greater long-distance constraint. An added
benefit of the multi-word units was to allow
for specific pronunciation modelling, such
as the sequence “going to” or “give me”
being produced as “gonna” or “gimme”
respectively. Common contractions such as
“what’s” were represented as multi-word
units (e.g., “what_is”) to reduce language
model complexity, and since these words
were often a source of transcription error
anyway. Additional multi-word candidates
were identified using a mutual information
criterion which looked for word sequences
which were likely to occur together. Table 5
shows examples of multi-word units found
in the JUPITER vocabulary.
Table 3. Categorical breakdown ofJUPITER vocabulary. Type Size Examples
geography 770 Boston, Alberta, France, Africa
basic 408 I, what, January, tomorrow
weather 167 temperature, snow, sunny
19SPOKEN LANGUAGE SYSTEMS
Language ModellingA class bigram language model was used in
the forward Viterbi search, while a class
trigram model was used in the backwards
A* search to produce the 10-best outputs for
the natural language component. A set of
nearly 200 classes were used to improve the
robustness of the bigram. The majority of
the classes involved grouping cities by state
or country (foreign), in order to encourage
agreement between city and state. In cases
where a city occurred in multiple states or
countries, separate entries were added to the
lexicon (e.g., Springfield Illinois vs. Spring-
field Massachusetts). Artificial sentences
were created in order to provide complete
coverage of all of the cities in the vocabu-
lary. Other classes were created semi-
automatically using a relative entropy metric
to find words which shared similar condi-
tional probability profiles. Table 6 shows
examples of word classes.
Since filled pauses (e.g., uh, um)
occurred both frequently and predictably
(e.g., start of sentence), they were incorpo-
Table 4. Examples of out-of-vocabulary words.
Table 5. Examples of multi-wordunits.
rated explicitly into the vocabulary, and
modelled by the bigram and trigram. These
words were stripped out of word hypotheses
before being passed to the natural language
component. When tested on a 1445
utterance test set the word-class bigram and
trigram had perplexities of 16.6 and 15.4,
respectively. These are slightly lower than
the respective word bigram and trigram
perplexities of 17.2 and 16.4. Note that the
class bigram also improved the speed of the
recognizer as it had 25% fewer connections
to consider during the search.
Acoustic ModellingWords in the lexicon were modelled with a
set of 69 labels, consisting of the standard
TIMIT labels augmented with a few larger
sonorant sequences such as vowel+/r/
which have been previously found to
improve the performance of our segment-
based recognizer. A small set of pronuncia-
tion rules were applied to the lexicon to
create a pronunciation graph. The majority
of the rules applied to cross-word effects
Word Usage Examplesup 29 hang up, come up, screwed up
pasadena 28
who 23 who are you, who created you
sure 19 sure, I am not sure
hey 17 hey this is neat
why 17 why did you tell me Sunday
point 16 dew point
el nino 14 do you know anything about el nino
can_you when_is never_mind
do_you what_about clear_up
excuse_me what_are heat_wave
give_me what_will pollen_count
going_to how_about warm_up
you_are I_would wind_chill
20 SUMMARY OF RESEARCH
SPONTANEOUS SPEECH RECOGNITION IN THE JUPITER DOMAIN
Table 6. Example word classesused in the JUPITER domain.
such as multiple possible realizations of stop
consonant sequences, and geminant
consonants.
Acoustic models consisted of both
segment-based models and boundary-based
diphone models. The 69 context-indepen-
dent segment-models used a set of 40
features consisting of duration, MFCC
averages computed over segment thirds, and
derivatives computed at segment end-points.
The 742 diphone models used a set of 50
features consisting of 8 MFCC averages
centered around a hypothesized boundary.
A set of pooled principal components were
used to whiten the feature space for both
sets of models.
ExperimentsOver the course of this reporting period
several different recognizers have been
trained and evaluated. Figure 4 shows the
results of a mid-December performance
evaluation, which plots the performance of
all recognizers on the same set of data
recorded in the fall of 1997. All testing has
been performed on the subset of collected
data considered to be within domain, and
excludes utterances with out-of-vocabulary
words, clipped speech, cross-talk, and other
kinds of noise. The within domain utter-
ances typically correspond to 70 to 75% of
the recorded data. Note that in practice, the
recognizer is often able to correctly process
raining snowing
cold hot warm
humid windy
extended general
advisories warnings
conditions forecast report
humidity temperature
these excluded utterances.
As shown in Figure 4, the performance
has consistently improved over time as the
recognizers have been able to incorporate
more sophisticated language and acoustic
models due to increased amounts of
training data. At the end of April the
recognizer was trained primarily on read
speech, and wizard-based spontaneous
speech. This laboratory trained system had
word-error rates of 7% on read-speech and
10% on spontaneous speech collected from
group members. However, the error rates
initially tripled on incoming data from real
users. Over the course of the year both word
and sentence error rates have been reduced
by a factor of three, and are now in the 10%
and 26% range, respectively. Note that the
spontaneous error rates are considerably
lower for experienced users (3% word error
and 10% sentence error rates, respectively).
Future WorkAlthough the performance of the system has
improved considerably over the course of
the year, there remains a considerable
number of interesting research topics to be
pursued in the future. The system to date
has used a pooled speaker model for all
acoustic modelling. It should be possible to
achieve gains through speaker normalization
and short-term speaker adaptation, and also
through better adaptation to the channel
21SPOKEN LANGUAGE SYSTEMS
April May June July Aug Nov0
10
20
30
40
50
60
70
Word Error
Sentence Error
Figure 4. Word and sentence errorrates throughout 1997.
conditions of individual phone calls.
Adaptation may also be useful to help
improve performance on non-native
speakers. Speakers with strong foreign
accents have thus far been excluded from
training and testing, and therefore often
experience increased error rates.
The data collection efforts have
produced a gold-mine of spontaneous
speech effects which are often a source of
both recognition and understanding errors.
For example, partial words typically cause
problems for the speech recognizer. Another
source of recognition errors are out-of-
vocabulary words, which are often cities not
covered in the vocabulary. These phenom-
enon can potentially be modelled better by
both acoustic and language models. The use
of dynamic vocabulary and language models
may also help alleviate some of the un-
known city problems. Finally, we plan to
explore a more fully integrated recognizer,
to try and impose additional constraint
from the language understanding compo-
nent.
JAMES GLASS
22 SUMMARY OF RESEARCH
Confidence Scoring for Speech UnderstandingChristine Pao, Philipp Schmid, and James Glass
Most speech recognition systems determine
the most likely sequence of words for a
given utterance. However, it is not always
desireable to make use of these word
hypotheses, especially when they are
incorrect! In a conversational system for
example, it is useful to also generate a
confidence score to indicate whether a word
sequence is actually correct. In this work, we
report on our initial efforts to produce
confidence scores for utterance rejection in
the JUPITER telephone-based weather
information domain.
An examination of data for the JUPITER
domain shows that 25-30% of the data
contain out-of-vocabulary words, clipped
speech, partial words, non-speech (e.g.,
silence, laughter), or other background
noise. Since it is not clear whether these
data can be correctly recognized or under-
stood by the system, they have been
excluded from all system evaluations,
although they clearly must be handled by
the deployed system. Our approach has
been to sort the JUPITER data into two
classes: data that the system can understand
and those which it cannot. We first devel-
oped methods which could reliably annotate
the data automatically, then developed
features and classifiers which could distin-
guish between the two classes.
Automatic AnnotationWe began the annotation process by first
classifying subsets of the data by hand. We
then used this manually annotated data to
develop and evaluate an automatic annota-
tion process. The automatic annotation
process we developed compares the seman-
tic frame generated from the correct
transcription with the frames corresponding
to the N-best word hypotheses produced by
the recognizer. The utterance is marked as
correctly understood if the semantic
representation of the first hypothesis which
parses matches the semantic frame gener-
ated from the correct transcription. If there
is no match or the transcription cannot be
parsed (has no semantic representation), the
utterance is marked as not understood.
The automatic annotation agreed with
the manual annotation in 1858 out of 2051
cases (90.6% agreement). When the 193
disagreements were examined, 154 were
resolved in favor of the automatic annota-
tion and 39 in favor of the manual annota-
tion. Assuming the cases where manual and
automatic annotation are in agreement were
correctly marked, the automatic annotation
is 98.1% accurate. According to the
automatic annotation, the train, develop-
ment, and testsets used for later experiments
had correct understanding rates of 63.6%,
56.6%, and 64.9% respectively. Note that
these results are considerably lower than the
near 80% understanding rates we achieve
on the within-vocabulary data subset, and
reflect the presence of the various phenom-
ena we mentioned earlier.
Feature SelectionThe next step was to develop a set of
features which would be relevant to
recognition/understanding confidence.
There were three classes of features which
we investigated. The first included scores
generated by the recognizer such as the
acoustic and language model scores. The
second was based the N-best list, and
included the number of sentence hypoth-
eses, word frequency statistics, and semantic
distances between individual sentence
hypotheses. The third class was based on
how many of the N-best hypotheses were
23SPOKEN LANGUAGE SYSTEMS
completely or partially understood by the
system. A total of 47 features were consid-
ered for classification.
Our first application for confidence
scoring was a classification task: separating
utterances likely to be understood (AC-
CEPT) by the system from utterances not
likely to be understood (REJECT). Fisher
linear discriminant analysis (LDA) was used
to select the best feature set for this task.
The feature sets were created iteratively. On
each iteration, N feature sets from the
previous iteration were each augmented
with one additional feature from M unused
features. The N*M new feature sets were
scored using LDA classification on a
development set, and the top N feature sets
were retained for the next iteration. The
LDA threshold for each classifier was set to
maintain a false rejection rate of 2% on the
development set. The procedure terminated
when no additional improvement was
found.
The best feature set achieved 60%
correct rejection on the development and
Figure 5. Distribution of Fisherlinear discriminant fordevelopment set.
test sets using a set of 14 features automati-
cally selected using the Fisher criterion. The
false rejection rate on the test set increased
slightly to 3%. Figure 5 shows the probabil-
ity distributions for the accept and reject
classes of the development set.
Decision Tree ClassificationOne problem with the Fisher classifier is
that the different features are combined
into a single measure, making it more
difficult to understand the importance of
and interaction between the individual
features. We decided it would be informa-
tive to see how individual features could be
used to partition the feature space. The
basic idea was to split off sets of tokens,
where the split set had a high percentage of
accept or reject tokens (i.e., high purity).
Regression trees were created by
searching for the feature vector which could
split a node (i.e., data subset) to meet purity
and size requirements. After a node had
been split, all remaining nodes which did
not meet the purity requirement were
0
100
200
300
400
500
600
-0.005 -0.004 -0.003 -0.002 -0.001 0 0.001 0.002 0.003
accept
reject
threshhold
24 SUMMARY OF RESEARCH
merged together for subsequent splitting. If
there was no split which met both the
purity and minimum size requirements, the
purity constraints were relaxed and the
search was repeated. A development set was
used for cross-validation purposes.
While the resulting regression tree
structure could be used directly as a
CONFIDENCE SCORING FOR SPEECH UNDERSTANDING
classifier for utterance rejection, there was a
greater degradation in performance when
moving from development to test sets, when
compared to our initial Fisher classifier
experiments. However, we did find the
regression trees useful for identifying
features which were useful for confidence
scoring.
25SPOKEN LANGUAGE SYSTEMS
PEGASUS: Flight Departure/Arrival/GateInformation SystemStephanie Seneff, Joseph Polifroni and Philipp Schmid
Given the success of our JUPITER telephone
only system for weather reports, we have
decided to develop a similar system for
information about the dynamic status of
flights. This is particularly timely because
several sites have recently been introduced
on the Web that maintain such informa-
tion. Basically, each major U.S. carrier
(United, American, Delta) offers their own
site, and there are also generic sites such as
thetrip.com and Microsoft's Expedia. We
believe that the flight domain is signifi-
cantly more challenging than weather in
terms of inheritance needs and dialogue
management.
We currently have an end-to-end system
in place, although its performance is not yet
adequate for general utilization. The
recognizer vocabulary currently contains
about 600 words, and the class bigram has
been trained on a small set of sentences that
were created manually. The system is able
to answer questions about the expected
departure or arrival time of specific flights,
and can also answer on-time requests and
provide gate information for some of the
carriers. We have built into it a dialogue
management system based on the BIANCA
dialogue manager that supports mixed-
initiative dialogue control, with PEGASUS
asking the user directly for information
necessary to access the various Web sites. As
is usually the case for our systems, the user
is not required to answer the system's
questions. An example dialogue (text input)
in PEGASUS is given in Table 7.
The PEGASUS system has some difficulty
in accurately recognizing flight numbers.
We have implemented a mechanism to filter
the N-best list based on any flight numbers
that were explicitly listed in a prior re-
sponse. This is implemented using the
same mechanism that was used to filter on
lists of cities provided by JUPITER, and is
another example of dialogue context
influencing recognition output. We believe
this mechanism will be important for the
overall success of the system.
Context tracking is a fairly challenging
aspect of the PEGASUS domain. Users often
assume that attributes from the past should
carry over to the current query, but it is
often ambiguous as to how much of the
context to retain. For example, if the user
asks, "Is there a United flight?", should any
prior restrictions on departure time be
remembered? When the user provides an
airline and flight number in a new query,
source and destination should be retained
only if they are consistent, but are necessary
when the flight exists as multiple legs. Users
can also refer pronominally to linguistically
embedded entities, as in the sequence:
"What is the arrival gate of United flight
nine?", followed by "Is it on time?" Alterna-
tively, obligatory roles can be elided, as in
"What is the arrival gate?" where an implied
flight-in-focus is not mentioned at all.
Aspects of inheritance have appeared in
PEGASUS that have led us to expand the
capabilities of the generic discourse
component.
We believe PEGASUS will soon be
sufficiently robust to enter a phase of
wizard-based data collection. In parallel, we
will collect read speech using our Web-based
telephone data collection facility. Once
there is a corpus of a few thousand sen-
tences, we can train the recognizer on these
data, and hopefully the system will then be
adequate for preliminary release to the
public, via a toll-free number.
26 SUMMARY OF RESEARCH
USER: Are there any flights from Chicago to Dallas this morning
PEGASUS: Can you provide an approximate departure time or arrival time?
USER: Leaving around nine
PEGASUS: I have found 5 flights from Chicago to Dallas that leave around 9o'clock a m. American Airlines flight 2209, American Airlines flight2311, United Airlines flight 1781, American Airlines flight 2213, andUnited Airlines flight 359. Please select one.
USER: Will the third one be on time
PEGASUS: United Airlines flight 1781 from Chicago is estimated to arrive inDallas Fort Worth at 11:09 am local time 8 minutes ahead ofschedule. It's currently 150 miles north east of Dallas Fort Worth,flying at 24000 feet with a speed of 502 miles per hour. Is theresomething else I can do for you?
USER: How about the first one
PEGASUS: American Airlines flight 2209 from Chicago landed in Dallas FortWorth at 10:12 am local time. Is there something else?
USER: What gate did it arrive at
PEGASUS: American Airlines flight 2209 is arriving in Dallas Fort Worth at gate11. What else?
USER: Nothing more thank you
PEGASUS: Thanks for using Pegasus! Have a nice day!
Table 7. Sample dialogue (typedinput) between a user and the newPEGASUS system.
PEGASUS: FLIGHT DEPARTURE/ARRIVAL/GATE INFORMATION SYSTEM
27SPOKEN LANGUAGE SYSTEMS
Using Aggregation to Improve the Performanceof Mixture Gaussian Acoustic ModelsT.J. Hazen and Andrew Halberstadt
This work [1] investigates the use of
aggregation as a means of improving the
performance and robustness of mixture
Gaussian models. This technique produces
models that are more accurate and more
robust to different test sets than traditional
cross-validation using a development set. In
[1], a theoretical justification for this
technique is presented, along with experi-
mental results in phonetic classification,
phonetic recognition, and word recognition
tasks on the TIMIT and Resource Manage-
ment databases. In speech classification and
recognition tasks error rate reductions of up
to 12+AFw-% were observed using this
technique. A method for pruning the
aggregated models is also presented. This
pruning method reduces computational
requirements while retaining some of the
benefits of aggregation.
Aggregation of probabilistic classifiers is
performed by averaging the outputs of a set
of independently trained classifiers. The
proof in [1] demonstrates that an aggregate
classifier is guaranteed to exhibit an error
metric which is equal to or better than the
average error metric of the individual
classifiers used during aggregation. This
technique is not restricted to mixture
Gaussian classifiers, but rather is valid for
any type of probabilistic classifier. The
proof is also completely independent of the
test data being presented to the classifier.
Thus, the method is robust because it
improves performance regardless of the test
set being used.
To demonstrate the effectiveness of
aggregation, Table 8 presents results
obtained on three different tasks, phonetic
classification, phonetic recognition, and
word recognition. All of the experiments
utilize the SUMMIT [2] speech recognition
system. This system utilizes mixture
Gaussian acoustic models to score segment-
based phonetic units. For each experiment,
24 individual sets of acoustic models were
independently trained. These 24 individual
trials were then utilized to create sets of
independent models for each aggregation
trial. Thus, the models from 24 individual
training trials are used to create 12 2—fold
aggregation trials, 8 3—fold aggregation
trials, etc., until only one trial of 24—fold
aggregation is performed. In each case,
aggregation was found to produce signifi-
cant improvements over standard training
techniques, with observed error rate
reductions of up to 12+AFw-%. Aggrega-
Table 8. Average accuracy of Mtrials of N-fold model aggregationfor the tasks of context-independentphonetic classification on TIMIT(CI-PC1 and CI-PC2), context-dependent phonetic recognition onTIMIT (CD-PR), and context-dependent word recognition onResource Management(CD-WR).
M=24 M=6 M=1
Task Test Set N=1 N=4 N=24 % Error Reduction
CI-PC1 dev 21.2 20.0 19.6 7.1
CI-PC1 core 22.1 20.7 20.2 8.3
CI-PC2 core 23.2 21.3 20.4 12.0
CD-PR dev 28.1 27.3 27.0 3.6
CD-PR core 29.3 28.4 28.1 4.0
CD-WR test 4.5 4.2 4.0 12.0
Average Error Rate (%)
28 SUMMARY OF RESEARCH
tion performs particularly well when the
standard training techniques produce
models that are overfit to the training data.
There are a large number of possible
generalizations and extensions of this work.
In a sense, model aggregation is equivalent
to linear interpolation of models, where the
interpolation is performed using equal
weights. Thus, any situation where models
are interpolated can be viewed as a form of
aggregation. This encompasses common
techniques such as the interpolation of
speaker-independent and gender-specific
models, or the interpolation of context-
independent and context-dependent
models. In recent experiments using SUMMIT,
aggregation was fruitfully applied to models
trained using different segmentations of the
speech signal. Aggregation is attractive
because of its simplicity, and because
empirical evidence indicates its effectiveness
in reducing the error rate in typical speech
classification and recognition tasks.
References[1] T. J. Hazen and A. K. Halberstadt, “Using
Aggregation to Improve the Performance ofMixture Gaussian Acoustic Models”, Proc. IEEEInternational Conference on Acoustics, Speech andSignal processing, Seattle, WA, May 1998.
[2] J. Glass, J. Chang, and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,’’ Proc. International Conference onSpoken Language Processing, pp. 2277-2280,Philadelphia, PA, September 1996.
USING AGGREGATION TO IMPROVE THE PERFORMANCE OF MIXTURE GAUSSIAN ACOUSTIC MODELS
29SPOKEN LANGUAGE SYSTEMS
BIANCA: A Dialogue Management Enginefor PEGASUSPhilipp Schmid, Stephanie Seneff and Joseph Polifroni
The purpose of a dialogue manager is to
provide for each dialogue turn in a human-
computer interaction the system's half of the
conversation, including verbal, tabular, and
pictorial responses. In addition to answering
the users request, the dialogue manager can
also initiate clarification sub-dialogues to
resolve ambiguities stemming either from
recognition errors (e.g., indicated by low
confidence scores for individual words),
from insufficient user-provided information
to answer the request, or from semantic
considerations (London, England or
London, Kentucky). The dialogue manager
needs to implement policies to deal with a
variety of dialogue situations: how to
present the answer to the user, especially in
the case of display-less system interaction
where only a limited amout of information
can be provided to the user per turn, or
what to do if the requested information is
not available. Additionally, it should
provide dialogue-context dependent
assistance upon request or detection of
difficulties in the interaction with the user.
In previous experience, the develop-
ment of dialogue management components
for conversational systems has been slow
and cumbersome, partly because of the
complex nature of the dialogue activities
and difficulties in visualizing the control
flow of the processing steps which are often
embedded in nested if-then-else statements
and/or nested function calls. Each change
to the control flow required a
recompilation, and visualization was done
typically using source-code debuggers. In
addition, a separate executable was needed
for each domain (e.g., JUPITER, PEGASUS and
CityGuide).
We have recently begun the develop-
ment of a new dialogue management engine
called BIANCA. The goal was to be able to
describe the dialogue plan in terms of
elementary functions acting on state
variables. Hence BIANCA is driven by external
Debugger
Rules
VariablesFunctions
src dest airline flight# sql nfound atime agate
"Don't care"
"Exists/Has value"
"Not set"
Figure 6. Graphical User Interfacefor BIANCA.
30 SUMMARY OF RESEARCH
tables containing the domain-dependent
specifications expressed in a high-level
specification language. As a side-effect of
this new engine, we are able to use a visual
debugger that clearly illustrates the control
flow (see Figure 6).
There are two important high-level
concepts in BIANCA: variables and functions.
In the case of the flight arrival and depar-
ture information domain (PEGASUS) the set
of variables (key/value pairs) captures three
essential types of state: (a) the user's
intentions (e.g., "What is the arrival gate?",
"Will the flight be on time?"), (b) the values
of variables essential to the domain (flight
number or airline name), and (c) variables
describing the state of the interaction with
external data sources (e.g., the number of
items retrieved from a database). The values
of these variables implicitly define the state
of the dialogue. Initial values are derived
from the input frame-in-context (using
GENESIS). The functions perform important
elementary tasks such as database retrieval
or clarification requests. In the process of
execution they change the state of one or
more of the variables. The order in which
these functions are executed is determined
by a set of rules (pre-conditions) that are
expressed as boolean expressions using the
keys and values of the variables (see table 9
for an example of variables and rules for a
small PEGASUS domain.). Each function can
be triggered by one or more rules and one
or more post-condition can hold true (e.g.,
either zero, one, less than three, or at least
three flight were retrieved from the data-
base). If in addition to the mandatory pre-
conditions we can also specify all the post-
conditions and their respective probabili-
ties, we can statically determine the optimal
dialogue strategy (e.g., which question to ask
the user in order to narrow down the
number of flights under consideration).
In parallel to the above effort, we are
reorganizing the generic domain-server code
to support a shared data structure that
captures all the important information
necessary to complete each dialogue turn.
This structure is intended to provide a
universal framework for dialogue manage-
ment, at least in the context of database-
access activities. The dialogue management
operates in essentially an object-oriented
formalism, in that all functions operate only
on the contents of the dialogue manage-
ment structure. Any high-level functionality
that should be essential for all domains is
provided in a server-shell, which should
expedite the process of porting to a new
domain. We envision ultimately a single
domain-server with separate libraries of
functions specializing in specific domains
such as JUPITER and PEGASUS.
BIANCA: A DIALOGUE MANAGEMENT ENGINE FOR PEGASUS
31SPOKEN LANGUAGE SYSTEMS
PHILIPP SCHMID, STEPHANIE SENEFF AND JOSEPH POLIFRONI
Variables:
airline, flight_number, from_airport, to_airport, query, num_found,arrival_gate
Rules:
airline & flight_number --> compose_query --> query airline & !flight_number --> need_flight_number --> reply query & !num_found --> evaluate_query --> num_found num_found > 1 --> speak_multiple --> reply num_found = 0 --> speak_no_info --> reply num_found = 1 & arrival_gate "which" --> speak_gate --> reply
User: "Which gate will United flight nine arrive at?"
Initial configuration:
airline: "UAL" flight_number: "9" arrival_gate: "which"
execute compose_query:
airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9"
execute evaluate_query
airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9" num_found: 1
execute speak_gate
airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9" num_found: 1 reply: "United flight nine is expected to arrive in San Francisco at gate 78"
Table 9. PEGASUS example.
32 SUMMARY OF RESEARCH
ANGIE-Based Pronunciation ServerAarati Parmar and Stephanie Seneff
Research in extending the ANGIE system has
been progressing along four fronts: 1) using
ANGIE for sophisticated prosodic modelling,
2) using ANGIE in a variety of ways in speech
recognition tasks, 3) acquiring a large
lexicon labelled according to ANGIE’s
conventions through semi-automatic means,
and 4) developing a pronunciation server
based on the ANGIE formalism. The first
three efforts are being conducted as student
theses [1,2,3] (c.f. pages 40, 52 and 62). The
fourth project was advanced mainly through
the efforts of Aarati Parmar in a summer
staff position, with assistance from Helen
Meng and Stephanie Seneff.
The goal of the summer project was to
develop an ANGIE server to provide pronun-
ciations and/or syllabifications for English
words. The pronunciations could be
provided either in a convention appropriate
for SUMMIT or the standard ANGIE phonemic
convention, and as a side-effect ANGIE’s
“morph” analysis [1] could be returned as
well. The training data for the pronuncia-
tion server were acquired from available
resources, using JUITER’s vocabulary as a
model for SUMMIT conventions, and using
the on-line COMLEX (more than 63,000
words excluding proper nouns and abbrevia-
tions) lexicon for providing letter-to-sound
rules. The conventions for COMLEX are
different from those of either SUMMIT or
ANGIE, so a set of rules defining a mapping
from COMLEX to ANGIE aided the acquisi-
tion process. We also attempt to utilize our
predefined “morph” lexicon to enhance
accuracy.
Table 10 shows some examples of
different transcription conventions for
words containing the morph unit “-tion” in
SUMMIT, ANGIE, and COMLEX. Notice that “-
tion” is realized as “sh ax n” in SUMMIT, as
“sh! en” in ANGIE, and as “S.In” in
COMLEX.
The operational flow is a series of
backoff steps as follows:
i) Given a word spelling, first look to see
if it already exists in our dictionary. If so,
return the pronunciation.
ii) Otherwise, look up the word-morph
inventory (we currently have 48K words
decomposed into morphs) to get a morph
decomposition. If lookup fails, parse with
letter terminals into a hypothesized morph
sequence.
iii) Having acquired the morph se-
quence, look up the morph pronunciations
file (each morph is labelled with a phonetic
transcription in this file, using SUMMIT
conventions). Then concatenate the morph
pronunciations together to get a full word
pronunciation.
iv) Apply rewrite rules to handle
phonological variations at morph bound-
aries, e.g., deletable stops in SUMMIT or
pluralization rules.
The pronunciation provided by ANGIE
for each morph in the morph lexicon has
been transformed from ANGIE format to
appropriate canonical SUMMIT phonemes
using a set of automatically generated ANGIE-
phoneme-to-SUMMIT-phoneme mapping
rules. The basis for such generation, as
mentioned above, is the 600-word vocabu-
lary in the JUPITER domain. By utilizing the
intermediate morph representation, we
guarantee consistency in the pronunciations
for alternate words that share the same
morph label.
33SPOKEN LANGUAGE SYSTEMS
This project has not yet been com-
pleted, as we expect the server to be
especially useful for generating pronuncia-
tions for novel proper nouns such as place
names. We believe it should be possible to
train it on a large set of place names
obtained semiautomatically from the Web.
Table 10. Examples of lexicalentries from SUMMIT, COMLEX,and ANGIE showing differingconventions. Note: “!” stands for“onset position, and “+” is a stressmarker.
SUMMIT: connections k ax n eh kd sh ax n zrestriction r ax s t r ih kd sh ax n
COMLEX: additions .xd’IS.Inzinterrogations .Int+Er.xg’eS.In
ANGIE: corruption k! er r! ah+ p sh! enreflection r! iy f! l eh+ k sh! en
References[1] A. Parmar. A Semi-Automatic System for the
Syllabification and Stress Assignment of Large Lexicons.M. Eng. thesis, MIT Department of ElectricalEngineering and Computer Science, May 1997.
[2] R. Lau. Subword Lexical Modelling for SpeechRecognition. Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,March 1998.
[3] G. Chung. Hierarchical Duration Modelling for aSpeech Recognition System. MIT Department ofElectrical Engineering and Computer Science,May 1997.