interactions between human–human multi-threaded dialogues and driving

10
ORIGINAL ARTICLE Interactions between human–human multi-threaded dialogues and driving Andrew L. Kun Alexander Shyrokov Peter A. Heeman Received: 14 March 2011 / Accepted: 17 July 2011 Ó Springer-Verlag London Limited 2012 Abstract In-car devices with speech user interfaces are proliferating. How can we build these interfaces such that they allow human–computer interactions with multiple devices to overlap in time, but without interfering with the driving task? We suggest that interface design can be inspired by the way people deal with this problem in human–human dialogues and propose discovering human dialogue behaviors of interest through experiments. In this paper, we discuss how to design an appropriate human– human dialogue scenario for such experiments. We also report on one human–human experiment, in terms of the dialogue behaviors found, and impact on the verbal tasks and on driving. We also offer design considerations based on the results of the study. Keywords Speech user interfaces Á Multi-threaded dialogue Á Spoken task Á Driving simulator 1 Introduction While driving, people still want to interact with a computer to accomplish various tasks such as getting navigation information, selecting music, or reading email or text messages. Interacting with a graphical user interface to accomplish these tasks can be dangerous, as it requires taking your hands and eyes away from the manual-visual task of driving. Speech interaction with a computer does not require the use of the hands and eyes, with the possible exception of operating a push-to-talk button. Thus, speech is a viable human–computer interaction mode in a vehicle. However, using speech as an interaction mode while driving needs to be done with care. For example, a number of researchers found that conversing on a mobile phone degrades driving performance, even when the phone is used in hands-free mode [1]. In our own work, we found evidence that certain characteristics of a speech user interface, for example, low recognition rate [2], can neg- atively influence driving performance. Speech interfaces have been used for performing a sin- gle task at a time, where the user finishes with one task before moving on to the next one. However, real-time tasks might require the user’s interactions on different tasks to overlap in time. For instance, a police officer might need to be alerted to a nearby accident while accessing a database during a traffic stop, or a driver might need driving instructions while reviewing appointments in a calendar. We refer to the speech interaction about each individual task as a dialogue thread and say that together they con- stitute a multi-threaded dialogue [3]. The dialogue threads can overlap with each other in time. Our long-term goal is to build a spoken dialogue system that allows the user to complete task-oriented multi-threaded spoken dialogues without negatively impacting performance on the driving task. To build such a system, we need to know what types of dialogue behaviors the system should engage in. We pro- pose that we should identify these behaviors by observing A. L. Kun (&) Á A. Shyrokov Electrical and Computer Engineering Department, University of New Hampshire, Kingsbury Hall, Durham, NH 03824, USA e-mail: [email protected] A. Shyrokov e-mail: [email protected] P. A. Heeman Biomedical Engineering Department, Center for Spoken Language Understanding, Oregon Health and Science University, Beaverton, OR 97006, USA e-mail: [email protected] 123 Pers Ubiquit Comput DOI 10.1007/s00779-012-0518-1

Upload: peter-a-heeman

Post on 11-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

ORIGINAL ARTICLE

Interactions between human–human multi-threaded dialoguesand driving

Andrew L. Kun • Alexander Shyrokov •

Peter A. Heeman

Received: 14 March 2011 / Accepted: 17 July 2011

� Springer-Verlag London Limited 2012

Abstract In-car devices with speech user interfaces are

proliferating. How can we build these interfaces such that

they allow human–computer interactions with multiple

devices to overlap in time, but without interfering with the

driving task? We suggest that interface design can be

inspired by the way people deal with this problem in

human–human dialogues and propose discovering human

dialogue behaviors of interest through experiments. In this

paper, we discuss how to design an appropriate human–

human dialogue scenario for such experiments. We also

report on one human–human experiment, in terms of the

dialogue behaviors found, and impact on the verbal tasks

and on driving. We also offer design considerations based

on the results of the study.

Keywords Speech user interfaces � Multi-threaded

dialogue � Spoken task � Driving simulator

1 Introduction

While driving, people still want to interact with a computer

to accomplish various tasks such as getting navigation

information, selecting music, or reading email or text

messages. Interacting with a graphical user interface to

accomplish these tasks can be dangerous, as it requires

taking your hands and eyes away from the manual-visual

task of driving. Speech interaction with a computer does

not require the use of the hands and eyes, with the possible

exception of operating a push-to-talk button. Thus, speech

is a viable human–computer interaction mode in a vehicle.

However, using speech as an interaction mode while

driving needs to be done with care. For example, a number

of researchers found that conversing on a mobile phone

degrades driving performance, even when the phone is

used in hands-free mode [1]. In our own work, we found

evidence that certain characteristics of a speech user

interface, for example, low recognition rate [2], can neg-

atively influence driving performance.

Speech interfaces have been used for performing a sin-

gle task at a time, where the user finishes with one task

before moving on to the next one. However, real-time tasks

might require the user’s interactions on different tasks to

overlap in time. For instance, a police officer might need to

be alerted to a nearby accident while accessing a database

during a traffic stop, or a driver might need driving

instructions while reviewing appointments in a calendar.

We refer to the speech interaction about each individual

task as a dialogue thread and say that together they con-

stitute a multi-threaded dialogue [3]. The dialogue threads

can overlap with each other in time. Our long-term goal is

to build a spoken dialogue system that allows the user to

complete task-oriented multi-threaded spoken dialogues

without negatively impacting performance on the driving

task.

To build such a system, we need to know what types of

dialogue behaviors the system should engage in. We pro-

pose that we should identify these behaviors by observing

A. L. Kun (&) � A. Shyrokov

Electrical and Computer Engineering Department, University of

New Hampshire, Kingsbury Hall, Durham, NH 03824, USA

e-mail: [email protected]

A. Shyrokov

e-mail: [email protected]

P. A. Heeman

Biomedical Engineering Department, Center for Spoken

Language Understanding, Oregon Health and Science

University, Beaverton, OR 97006, USA

e-mail: [email protected]

123

Pers Ubiquit Comput

DOI 10.1007/s00779-012-0518-1

the way humans manage dialogues between drivers and

one or more remote or co-present conversants. The

behaviors can be characterized by different utterance types,

pauses, and speaking rates, as well as higher-level dialogue

reasoning. We can also expect to find patterns in how

conversants alter their speech as the driving difficulty

changes. The hypothesis that we can use behaviors

observed in human–human dialogues between a driver and

a co-present conversant is supported by evidence that the

presence of a passenger, and thus very likely conversing

with a passenger, reduces the probability of a collision [4].

In this paper, we focus on laying the groundwork for in-

car speech user interfaces capable of carrying out multi-

threaded spoken dialogues with drivers. After an overview

of related research, we first propose a set of considerations

that researchers should take into account when designing

spoken tasks aimed at exploring human–human dialogue

behaviors. Next, we report on the results of an experiment

exploring human–human behaviors in multi-threaded dia-

logues while one of the conversants is operating a simu-

lated vehicle. We discuss the effects of driving on dialogue

behaviors and the effects of dialogue on driving. Finally,

we discuss design implications for in-car speech systems as

well as future research directions.1

2 Background

In our previous work, we studied how people manage

multi-threaded dialogues. As part of that work, we

explored different verbal tasks to use. In our first experi-

ments, we had participants interact with an actual spoken

dialogue system [6]. Due to the complexity involved in

building a functional system, participants completed simple

tasks with the computer, including addition, circular rota-

tion of number sequences, discovery of short letter

sequences, and category-matching word detection. How-

ever, participants did not find these tasks engaging, and the

resulting dialogues did not seem to capture the complexity

of behaviors we expected to see in more realistic tasks. In

some of the pilot experiments, we tried to motivate par-

ticipants by telling them they were playing a game and

their goal was to solve as many tasks as possible; however,

even this did not seem to help.

To make the tasks more engaging and realistic, we

turned our attention to human–human dialogues. We gave

conversants an ongoing task in which they had to work

together to form a poker hand [3]. Each conversant had

three cards, and they took turns drawing and discarding a

fourth card. Conversants could not see each other, nor

could they see the cards in each other’s hands. They

communicated via headsets and used speech to share what

cards they had and what poker hand to try for. Periodically,

one of the conversants was prompted to solve a real-time

task, that of determining whether the other conversant had

a certain picture displayed on her screen. The urgency of

the real-time task was an experimental variable: conver-

sants were given either 10, 25, or 40 s to complete it. To

make the task engaging, conversants received points for

each completed poker hand and each picture task. We

found that this setup elicited both rich collaboration for the

card game [7] and interesting task management. The

problem is that this setup, with the ongoing task having a

minor manual-visual component, is not representative of

the types of tasks that are of interest to us, in which the

ongoing task is exclusively verbal and where the user is

also engaged in the separate manual-visual task of driving.

For our next study, we used a navigation problem as the

ongoing task [8], inspired by the Map Task experiments

[9]. One conversant (the driver) operated a simulated

vehicle and a second (the dispatcher) helped the driver

navigate city streets. The conversants could not see each

other and communicated via headsets. Unknown to the

dispatcher, some of the city streets were blocked by con-

struction barrels, and so the driver was unable to follow

some of the dispatcher’s instructions. The conversants thus

collaborated to find an alternate route. Periodically, the

driver was prompted to initiate a short real-time task with

the dispatcher. As in the poker-playing task, the prompt

included information about the urgency of the real-time

task. Although this setup elicited rich task management

behavior, participants do not seem to build up discourse

context as they converse. This is because the verbal com-

ponent of the navigation task is more like a series of sep-

arate short real-time tasks. Thus, these verbal tasks might

lack the difficulty that longer verbal tasks might pose.

Villing et al. [10] performed human–human multi-

threaded dialogue experiments in a real car on city roads.

The driver and a passenger were given a navigation task

and a memory task. The participants were not restricted on

how to complete these tasks. Video recordings of the par-

ticipants and the road were taken. The authors found that

drivers and passengers used specific Swedish cue phrases

in only 17 and 12 % of topic shifts, respectively. In further

research, Lindstrom et al. [11] looked at speech disfluency

rates as a function of cognitive load. The authors found that

when the driver is under high cognitive load, the passen-

ger’s disfluency rate decreases. This indicates that the

passenger makes an attempt to be particularly clear and

concise when he perceives that the driver is in a difficult

situation. In subsequent work, Villing [12] found that

passengers do not interrupt or resume dialogue threads

when the driver is subject to high driving-induced

1 Most of this work is derived from the second author’s Ph.D.

dissertation [5].

Pers Ubiquit Comput

123

workload. Their research utilized multiple modalities for

driver-passenger communication and focused on natural

language features. In contrast, our research focuses on a

single modality of interaction between the participants and

has more structured tasks.

Vollrath [13], in work on single-threaded dialogues,

investigated the influence of spoken tasks on driving per-

formance by examining a number of different studies. He

used Wickens’ multiple resource model [14] as the

framework to process the data from the studies. One of his

conclusions was that, to minimize effects of verbal tasks on

driving, the tasks must be simple and short. However, tasks

that are of interest to users are often neither short nor

simple. We are interested in learning how humans manage

complex dialogues and applying the lessons learned to

designing safe in-car human–computer interaction.

3 Designing spoken tasks

Based on our experiences with the studies described in

Sect. 2, we set forth the following requirements [15] for

verbal tasks to be used in human–human experiments in

general and in our experiment in particular:

• Tasks are engaging. Real tasks that users want to

perform will undoubtedly be engaging and will have

the potential to divert users’ attention from driving

[14], [16].

• Tasks are relatively complex, and require both partic-

ipants to participate. Tasks that can be accomplished

with one of the participants speaking little, or not at all,

would provide little data to evaluate different dialogue

behaviors in human–human spoken interaction.

• Tasks allow scoring participant performance. This will

allow us to quantitatively evaluate participant perfor-

mance under different driving task difficulty levels, and

when testing the impact of different dialogue behaviors.

• Tasks have identifiable discourse structure. This will

allow us to easily analyze how the conventions interact

with discourse structure. In particular, tasks should give

rise to adjacency pairs, such as question–answer pairs,

as these are common in human–machine spoken

interaction; therefore, learning more about this partic-

ular type of interaction will be valuable.

Wickens’ multiple resource model [14] indicates that

verbal tasks should not interfere with driving significantly.

However, studies such as Strayer and Johnston’s [16]

clearly show that verbal tasks that are engaging have the

potential to divert the user’s attention from the driving task.

This effect presents a challenge to Wickens’ multiple

resource model. Hence, as a fifth consideration, experi-

menters need to consider the extent of interference between

the driving task and the verbal tasks.

4 Experimental setup

We now describe our study that explored multi-threaded

human–human dialogues. In this study, pairs of participants

(driver and dispatcher) were engaged in two spoken tasks

and one of the participants (the driver) also operated a

simulated vehicle. One spoken task was the ongoing task,

and it was periodically interrupted by another spoken task.

The interruptions forced participants to switch between

different dialogue threads. The two spoken tasks follow the

requirements from the previous section.

The driver operated a high-fidelity driving simulator

(DriveSafety DS-600c), with a 1808 field of view, realistic

sounds, and vibrations, a full-width car cab, and a tilting

motion platform that simulates acceleration and braking

effects. We recorded driving performance measures such as

lane position and steering wheel angle. We also recorded

the drivers’ eye gaze direction and pupil diameter using a

Seeing Machines faceLab 4.6 stereoscopic eye tracker

mounted on the dashboard.

As shown in Fig. 1, the participants could not see each

other and communicated solely through headsets and we

recorded their dialogues. The dialogues were supervised by

the experimenter to enforce time limits on the verbal tasks.

The experiment was completed by 16 pairs of partici-

pants (32 participants). Each pair was formed by two

people who had never met each other before. Participants

Fig. 1 Driver and dispatcher

Pers Ubiquit Comput

123

were between 18 and 38 years of age (average age was

24 years) and 28 % were female.

4.1 Driving task

The primary task of the drivers was to follow a vehicle

while driving responsibly. They drove on two-lane, 7.2 m-

wide roads in daylight. The lead vehicle travelled at 89 km/

h (55 mph), and it was positioned 20 m in front of the

participant. There was also a vehicle 20 m behind the

participant’s car. No other traffic was present on the road.

The difficulty of the driving task is influenced by factors

such as road type, traffic volume, visual demand of route

following, etc. In this research, we only manipulated visual

demand of route following. This can be accomplished by

controlling the radius of curves [17]. Sharper curves cor-

respond to increased visual demand. We created routes

with six straight and six curvy road segments with straight

and curvy segments alternating. Each segment was long

enough for participants to complete an ongoing and an

interrupting task, allowing us to evaluate task-switching

behavior for the given visual demand level.

4.2 Ongoing spoken task

Our ongoing spoken task was a parallel version of twenty

questions (TQ). In TQ, the questioner tries to guess a word

the answerer has in mind. The questioner can only ask yes/

no questions, until she is ready to guess the word. In our

version, the two conversants switch roles after each ques-

tion–answer pair is completed. The commonality between

their roles allows us to contrast their behaviors. Words to

guess were limited to a list of household items (hair dryer,

TV, etc.). In order to minimize learning effects and to make

the dialogue pace more realistic, we trained participants to

use a question tree to guess the objects. For example, each

item was in one of three rooms, thus the first fact to be

established was the room where the object was located.

In parallel twenty questions (PTQ), when a participant in

one TQ game ventured a guess (right or wrong), that game

stopped, but the other TQ game continued until the other

participant also ventured a guess. Furthermore, a PTQ game

stopped if the time limit was exceeded. Thus, a PTQ game can

result in (a) both participants guessing their word correctly,

(b) one participant guessing correctly and the other guessing

wrong, (c) one participant guessing correctly and the other

running out of time, (d) one participant guessing wrong and

the other running out of time, (e) both participants guessing

wrong, or (f) both participants running out of time. These

scores can be translated into scores for the individual TQ

games of (1) guess correctly, (2) guess wrong, or (3) timeout.

Also, as we wanted to focus on task switches and not

negotiations on which conversant should start, the PTQ game

was always started with the dispatcher asking a question first.

The words to be guessed were presented to the participants

visually. We showed words to the driver just above the

dashboard, which minimizes interference with driving.

4.3 Interrupting spoken task

Our interrupting task was a version of the last letter word

game (LL). In our version of this game, a participant utters

a word that starts with the last vowel or consonant of the

word uttered by the other participant. For example, the first

participant might say ‘‘page’’ and the second says ‘‘earn’’

or ‘‘gear.’’ Interruptions were cued by displaying a letter to

one of the participants, along with a progress bar indicating

time remaining to complete the task. The participant who

receives the cue then tells the other participant to begin the

LL game and with what letter. Participants had 30 s to

name three words each. After completing this task, par-

ticipants resumed the PTQ game.

4.4 Interruption timing

Each pair of conversants drove on 12 road segments,

playing one PTQ game with one LL game interrupting it

during each segment. We manipulated interruption timing,

as this might impact dialogue behaviors, driving, and per-

formance on the spoken tasks. Using a set protocol, the

experimenter initiated an interruption during the PTQ game

after one, two, or three question–answer pairs for the driver

(and dispatcher who went first). Each participant was

interrupted six times, twice for each of the three interrup-

tion timings. There was a random delay of 0–10 s before

the initiated interruption was displayed on the participant’s

screen. This ensured that the experimenter did not intro-

duce a bias into the procedure. The delay range was based

on pilot experiments that indicated that it takes about 10 s

to complete a question–answer pair.

5 Results

The participants played a total of 384 twenty questions

games (16 participant pairs 9 12 segments/participant

pair 9 2 parallel games/segment = 384 games). Ideally,

they would have switched to 192 last letter games (16

participants 9 12 segments/participant 9 1 game/seg-

ment = 192 games). However, 15 interruptions were

introduced after the targeted participant completed the

ongoing task (three such cases for drivers, 12 for dis-

patchers). Overall, we recorded 9.3 h of speech interactions

with synchronized simulator and eye-tracker data. The

driving and eye-tracker data were collected over 800 km

travelled.

Pers Ubiquit Comput

123

5.1 Eliciting rich dialogue behavior

The verbal interactions were engaging, as demonstrated by

the example in Table 1. In this example, the two conversants

successfully completed both TQ games and the interrupting

LL game. As can be seen, the PTQ game runs from utterance

1 through utterance 10. During the third pair of question–

answer pairs, at utterance 11, the dispatcher introduces the

last letter game and tells the driver to say a word that begins

with B. The LL game finishes in utterance 17, after which the

driver signals it is over with the word ‘‘Okay.’’ The dis-

patcher then helps resume the PTQ by saying ‘‘Your turn to

ask.’’ In utterance 28, the driver guesses correctly finishing

one of the TQ games, and in utterance 30, the dispatcher also

guesses correctly and ends the second TQ game.2

Figure 2 shows how the TQ games ended. We see that

the driver and dispatcher were usually able to correctly

guess the object. TQ games were clearly challenging to

both participants, but not unreasonably so.

5.2 Managing multi-threaded dialogues

Next, we explored how people manage switches between

dialogue threads. In our study, participants switched from

the ongoing task to the interrupting task and back. Switches

to the interrupting task were cued by a letter displayed for

one of the participants. Thus, we investigated the rela-

tionship between when interruptions are cued and when

participants act on the cue. We also investigated how they

signal the beginning and the end of the interrupting task

and how they resume the ongoing task.

5.2.1 Responding to interrupting task cue

We tabulated all of the interruptions in terms of when they

were cued to the participant and when the participant

actually interrupted. We examined the cues and switches in

terms of whether they were during a question, between the

question and the answer, during the answer, or after the

answer and before the next question. The results are shown

in Fig. 3. First, the number of cued interruptions is not the

same for each region. This is because the cues were semi-

randomly distributed, and so the number of cues in each

region reflects the amount of time that participants spent in

the region. In fact, we see that the region during the answer

is the smallest, which is due to the short length of answers

in TQ, usually ‘‘yes’’ and ‘‘no.’’ Second, and more

importantly, we see that participants did not make the same

number of interruptions in each region as the number cued.

If participants tended to wait a fixed amount of time to

interrupt, the interruptions should also be randomly dis-

tributed between the regions based on the lengths of the

regions and so would result in approximately the same

number of switches as there are cues. This is definitely not

the case. In fact, we see that relative to the number of cues,

Table 1 Example ongoing and interrupting tasks

Utterance # Speaker Utterance Task

1 Disp. Is it in the kitchen? TQ

2 Driver No. TQ

3 Driver Does it have sharp edges? TQ

4 Disp. No. TQ

5 Disp. Is it in the bathroom? TQ

6 Driver No. TQ

7 Driver Does it produce heat? TQ

8 Disp. No. TQ

9 Disp. Is it on the ceiling? TQ

10 Driver No. TQ

11 Disp. Letter, word beginning with B. LL

12 Driver Ball. LL

13 Disp. Like. LL

14 Driver Kite. LL

15 Disp. Time. LL

16 Driver Move. LL

17 Disp. Voice. LL

18 Driver Okay. End LL

19 Disp. Your turn to ask. Resume TQ

20 Driver Does it have a door? TQ

21 Disp. Yes. TQ

… TQ

28 Driver Is it the refrigerator? TQ

29 Disp. Yes. TQ

30 Disp. Is it the TV? TQ

31 Driver Yes. TQ

Fig. 2 Game outcomes

2 Some subject pairs, including the ones from the dialogue excerpt in

Table 1, decided that the dispatcher’s first utterance would also reveal

the location of the dispatcher’s object. Thus, utterance 1 in Table 1 is

also communicating that the dispatcher’s object is in the kitchen. In

our study, three of the 16 participant pairs negotiated during the

training period.

Pers Ubiquit Comput

123

participants tended to make more interruptions after an

answer and tended to avoid making interruptions during a

question or answer. This suggests that participants are

timing their interruptions so as to not interrupt an utterance,

and to a lesser extent, to not interrupt a question–answer

pair. This result is similar to our earlier finding in the

navigation domain [8] and is consistent with our findings

on larger discourse structures [18]. The reason for this is

that interrupting in the middle of an utterance will probably

result in that utterance needing to be resaid. Interrupting in

the middle of a question–answer pair results in both con-

versants needing to remember that there is a question

pending, and not being sure whether it should be resaid.

Thus, this switching behavior is probably helping to min-

imize the impact of multi-threaded dialogues on driving.

5.2.2 Signaling the beginning of the interrupting task

Participants could have used cue words to signal switches

between the ongoing and the interrupting tasks. However,

in our study, this happened only four times (out of 177

interruptions). This is probably due to the game design: the

last letter game includes an explicit initiation as the par-

ticipant who is cued has to tell the other what the first letter

is. Thus, participants simply started the game, as in the

example in Table 1, where the dispatcher just says ‘‘Letter,

word beginning with B.’’

5.2.3 Signaling the end of the interrupting task

In contrast to beginning a LL game, ending a LL game

often required signaling, as both participants have to ver-

bally agree that the game is indeed complete. The behav-

iors we observed at the end of LL games were as follows:

• Explicit confirmation, such as ‘‘We are done’’ or

‘‘That’s my three’’;

• Discussion, such as ‘‘Are we done?’’;

• Wrong confirmation, such as ‘‘We are done, oh, I have

another word’’;

• Implicit confirmation, such as ‘‘Okay’’;

• No confirmation (simply resuming the ongoing task).

The relative frequencies of these behaviors are shown in

Fig. 4. Explicit and implicit confirmations were most

common, and drivers and dispatchers engaged in very

similar behaviors, possibly accommodating to each other’s

style (more on accommodation in Sect. 5.3). These results

contrast with those found in our earlier work in which

poker playing is being interrupted with picture identifica-

tion [18]. In that corpus, we did not find any explicit or

discussion dialogue behavior to close the interruption task.

The reason for this is that the picture identification task has

a simple structure of a question followed by an answer.

When the second person gives the answer, both participants

know the game is finished. There is no question that they

will return to the poker-playing task. With the last letter

game, participants might lose track of the number of words

said or think the other might have. The contrast between

these two multi-task setups suggests that speakers reason

about the need for dialogue behaviors in terms of reducing

confusion about task-switches.

5.2.4 Resuming the ongoing task

In switching back to the TQ game, we found that partici-

pants would sometimes actively restore the context of the

task. We observed the following resumptive behaviors:

• Summary, such as ‘‘I was in the living room’’;

• Question, such as ‘‘Was I in the living room?’’;

• Reminder, such as ‘‘Yours has a door’’;

• Other, that does not fit within the three types above;

• No context restoration.

As can be seen in Fig. 5, approximately 25 % of

switches back to TQ were initiated with a resumptive

behavior. This rate is similar to that observed in switching

back to the poker-playing game in [18]. The similar rate is

Fig. 3 Interruption cue and switch to LL game

Fig. 4 Behaviors at the end of LL games

Pers Ubiquit Comput

123

probably a coincidence, as the rate of resumptive behavior

probably depends on the length and complexity of the

interruption, and the amount of context of the task that is

being restored. These two aspects differ dramatically

between the two multi-task setups, with the poker playing

having a large context, but a simple interruption, which is

somewhat the opposite for the PTQ and LL games.

It is interesting to note that we did not see any signifi-

cant difference in task-switching behavior depending on

the road conditions, on how far into the ongoing task the

interruption was cued, or between the driver and the dis-

patcher. This might be due to insufficient data, or because

the verbal tasks were not difficult enough, or simply

because human speakers are not capable of adjusting these

task-switching behaviors to take into account differences in

cognitive load.

5.3 Accommodation in dialogue behavior

We also explored whether the speech from each pair of

speakers correlated with each other. Figures 6 and 7 show

the average duration of a pause before asking a question or

naming a word for all participant pairs. Pauses also include

any initial fillers, such as ‘‘um’’ and ‘‘uh,’’ as well as any

false starts. Hence, inter-turn pauses capture the amount of

time that participants needed to actually start the context of

their utterance. Even though it is clear that different

participants have different pause durations, there is a sig-

nificant correlation between the participants for the ongo-

ing task (r(16) = 0.502, p = 0.048) and the interrupting

task (r(16) = 0.840, p \ 0.001). Furthermore, as shown in

Fig. 8, the correlation between driver and dispatcher

speaking rates during the interrupting task is also signifi-

cant (r(16) = 0.821, p \ 0.001). These plots suggest that

participants adapted their speech to each other, which is

consistent with the findings of Oviatt et al. [19].

It is interesting to note that we did not see any signifi-

cant difference in task-switching behavior depending on

the road conditions, on how far into the ongoing task the

interruption was cued, or between the driver and the dis-

patcher. This might be due to insufficient data, or because

the verbal tasks were not difficult enough, or simply

because human speakers are not capable of adjusting these

task-switching behaviors to take into account differences in

cognitive load.

The question of who accommodated to whose behavior

remains open. In this experiment, dispatchers have a lower

cognitive load as they do not have a manual-visual task.

Thus, we hypothesize that they are the ones accommo-

dating to the driver’s behavior. Under this hypothesis,

drivers modulate timing parameters of their own speech in

order to signal to dispatchers the speed at which they can

Fig. 5 Resumptive behaviors for TQ game

Fig. 6 Pause before asking a question in twenty questions for

different participant pairs

Fig. 7 Pause before naming a word in last letter game for different

participant pairs

Fig. 8 Speaking rate during last letter game for different participant

pairs

Pers Ubiquit Comput

123

comfortably play a game. Dispatchers recognize the mod-

ulation and use similar timings in their own speech.

5.4 Effects of interruption timing

We next evaluated how the timing of the interruption

influenced the twenty questions game outcome. We

focused on games with correct guesses and timeouts, since

dialogue behaviors in these games might be different from

the ones observed in games that resulted in wrong guesses.

We also focused on games in which the interruption hap-

pened after two, three, and four question–answer pairs for a

given participant. We designate these interruptions Early,

Middle, and Late, respectively.

We found that, while dispatchers guessed correctly in

88 % of games that were interrupted early, drivers were

only able to do so in 64 % of their games—a significant

difference (t(15) = 2.13, p \ 0.05). Drivers’ worse per-

formance could be the result of the following factors: longer

pauses before asking a question or before responding, more

verbose questions (although not answers, since they were

primarily short yes/no responses), and slower speaking rate.

The data in Fig. 9 indicate that longer pauses before asking

a question are a significant factor. The timing of the inter-

ruption had a significant effect on the average pauses before

asking a question for drivers (F(2,13) = 4.86, p \ 0.027)

but not for dispatchers. Specifically, drivers had signifi-

cantly longer pauses during the games with early interrup-

tions, and these longer pauses must have occurred after the

interruption. There was also a significant effect of longer

pauses before answering questions in games with early

interruptions. However, the size of this effect is much

smaller than the size of the pause-before-question effect.

The duration and speaking rate of questions was the same

for all conditions. Therefore, the pause before asking a

question was a major reason why drivers timed out in more

games with early interruptions.

Why did early interruptions apparently result in longer

pauses before questions for drivers? The reason for this

could be that somehow they lessened the sense of urgency

on the driver compared to the middle and late interruptions.

Another reason might be that in games with early inter-

ruptions, drivers had a difficult time restoring the TQ game

context after switching back from the LL game. In tran-

scribing TQ dialogues, we considered initial fillers (e.g.

‘‘ummm’’) and initial disfluencies (e.g. ‘‘is it… is it in the

kitchen’’) as part of pauses. It is possible that the rate of

utterances with fillers and disfluencies is larger in games

with early interruptions, which might indicate difficulty in

restoring context. This difficulty in restoring the context

might result in the driver needing more questions than the

dispatcher to arrive at the correct answer, resulting in a

higher rate of timeouts for the driver.

5.5 Effects of spoken tasks on driving

In the previous sections, we explored various dialogue

behaviors in human multi-threaded dialogues when one of

the conversants is also driving. In this section, we turn our

attention to the crucial question of how spoken dialogues

can affect driving.

Figure 10 shows average lane position variance on dif-

ferent road segments before, during, and after interruptions.

Statistical analysis revealed that there is a significant dif-

ference in the lane position variance when comparing

measurements before, during, and after interruptions

(F(2,30) = 10.0, p \ 0.001) for curvy roads (F(2,30) =

6.3, p \ 0.005) and for straight roads (F(2,30) = 6.3,

p \ 0.005). Post hoc comparison showed that on curvy

roads, the lane position variance is larger during

(p \ 0.002) and after (p \ 0.007) interruptions than before,

with no significant difference between variances during and

after interruptions (p \ 0.175). Similarly, on straight roads,

lane position variance is larger during (p \ 0.002) and

after (p \ 0.005) than before interruptions, with no differ-

ence between variances during and after interruptions

(p \ 0.225).

It seems that the lane position variance on curvy and

straight roads was affected similarly by the presence of

interruptions (in both cases driving performance worsened

during the interruption). We attribute this effect to the

increased cognitive load during and after the interrupting

task.

Our conclusion that cognitive load increased during and

after the interruption is supported by participants’ sub-

jective evaluations of the tasks. After the experiment,

participants rated their agreement with a number of pref-

erential statements using a Likert scale. There were two

questions that showed how participants perceived the dif-

ficulty of the spoken tasks: ‘‘Twenty Questions game was

difficult’’ (the ongoing task) and ‘‘Last letter word game

was difficult’’ (the interrupting task). Statistical analysisFig. 9 Pause before asking question in ongoing task

Pers Ubiquit Comput

123

showed that the drivers and the dispatchers rated the

interrupting task as significantly more difficult than the

ongoing task (F(1,15)=6.25, p \ 0.002). Physiological

estimates of cognitive load lend further support to this

conclusion. Specifically, in other work on the same data

[20], we demonstrated that on curvy segments, physio-

logical measures based on the task-evoked pupillary

response [21] track the performance measure of lane

position variance shown in Fig. 10. Clearly, our partici-

pants felt that the tasks were of different difficulty levels,

and this is consistent with the performance and physio-

logical measures. It could be that the participants expected

the interrupting task to be more difficult and paid more

attention to it. For drivers, this might have come at the cost

of somewhat neglecting the driving task. Due to the design

of our study, we cannot determine whether this is due to the

LL game itself or to the fact that the LL game interrupts the

PTQ game. We feel that at least some of the difficulty

stems from introducing an interruption and requiring par-

ticipants to switch between tasks.

6 Design implications

In the introduction, we stated that using speech as a mode

of interaction while driving has to be done with care. The

results of our human–human study indicate that this gen-

eral statement will hold for the case of multi-threaded

human–computer dialogues. Our results offer clear evi-

dence that these dialogues can negatively impact cognitive

load in general, and driving performance (a measure of

cognitive load) in particular. Specifically, we found that

switching from an ongoing to an interrupting verbal task is

associated with increased cognitive load. Furthermore, the

cognitive load might not return to lower levels immediately

after switching back to the ongoing task. Designers should

carefully evaluate if, under various driving conditions,

switching from one verbal task to another (and back) might

lead to increased driver cognitive load and decreased

driving performance.

In exploring multi-threaded dialogue behaviors, we

found evidence that humans plan task-switches between

verbal tasks such as to reduce cognitive load. When

switching back to a task, they might also restore the context

of that task. However, the type and frequency of the

utterances in these switches are task dependent. Without

these utterances, the cognitive load of completing the

verbal tasks might be too high. Consequently, drivers

might abandon the verbal tasks or their driving perfor-

mance might suffer. Designers should explore ways in

which to endow computers with behaviors to support dia-

logue switches. Our results provide a starting point for

exploring appropriate computer behaviors.

Our results also indicate that humans use accommoda-

tion in multi-threaded dialogues between a driver and a

remote conversant. It is likely that the remote conversant

(our dispatcher) is accommodating to the driver. Designers

should explore ways in which a computer can detect dif-

ferent driver behaviors, such as pauses in speech, and

mimic them in order to help reduce the driver’s cognitive

load. It is also possible that the computer can trigger

accommodation behavior in the driver in order to reduce

his/her cognitive load. If, for example, the computer

detects high driver cognitive load using physiological or

driving performance measures, it can modulate timing

parameters of its own speech and expect the driver to

match them. This can result in slowing down the pace of

the verbal tasks, with a result of reduced overall driver

cognitive load.

7 Conclusions

With the number of in-car devices growing rapidly, it is

clear that designers will want to allow interactions between

drivers and these devices to overlap in time, resulting in

multi-threaded dialogues. In this paper, we aimed to lay the

groundwork for developing in-car speech interfaces capa-

ble of handling such dialogues while allowing the driver to

maintain adequate driving performance. We hypothesize

that such an interface can be built by learning from the way

humans (safely) handle in-car dialogues.

Thus, we proceeded to design a human–human study

exploring dialogue behaviors while one conversant is

operating a simulated vehicle. To this end, we first proposed

requirements that the design of the verbal tasks for such a

study should follow. We created two tasks that satisfy these

requirements and found that they can be used to elicit a

variety of dialogue behaviors between the conversants.

Evidence from driving performance measures, sub-

jective measures, and pupil-diameter-based physiological

measures indicates that engaging in multi-threaded dia-

logues can increase driver cognitive load. This supports theFig. 10 Lane position variance

Pers Ubiquit Comput

123

results of Vergauwe et al. [22] indicating that verbal and

manual-visual processes compete for some shared resour-

ces. On the other hand, we found evidence that the driver

and the remote conversant (the dispatcher) engaged in

dialogue behavior (accommodation) that might allow them

to collaboratively reduce the driver’s cognitive load.

While we observed a variety of dialogue behaviors, we

expect that in real-world dialogues in vehicles, human

conversants employ a much wider array of behaviors in

order to adapt to the ever-changing requirements of the

road. We believe that the reason these behaviors were not

observed in our study is that the options at the conversants’

disposal were limited by the relative simplicity of the

dialogue structures. After all, there is only so much one can

do to change the flow of the last letter game! As indicated

by Drews et al. [23], many other studies have a similar

limitation. For future studies, this observation points to the

need to employ a wider range of tasks in terms of their

complexity.

Acknowledgments This work was funded by the NSF under grant

IIS-0326496 and by the US Department of Justice under grants

2006DDBXK099, 2008DNBXK221, 2009D1BXK021, and

2010DDBXK226.

References

1. Horrey WJ, Wickens CD (2006) Examining the impact of cell

phone conversations on driving using meta-analytic techniques.

Hum Factors 48(1):196–205

2. Kun AL, Paek T, Medenica Z (2007) The effect of speech

interface accuracy on driving performance. In: Proceedings of

Interspeech, Antwerp, Belgium

3. Heeman PA, Yang F, Kun AL, Shyrokov A (2005) Conventions

in human–human multi-threaded dialogues: a preliminary study.

In: Proceedings of the international conference on intelligent user

interfaces, San Diego, CA

4. Reuda-Domingo T, Lardelli-Claret P, de Luna-del-Castillo JD,

Jimenez-Moleon JJ, Garcia-Martin M, Bueno-Cavanillas A

(2004) The influence of passengers on the risk of the driver

causing a car collision in Spain: analysis of collisions from 1990

to 1999. Accident Anal Prevent 36:481–489

5. Shyrokov A (2010) Human-human multi-threaded spoken dialogs

in the presence of driving. Doctoral dissertation, University of

New Hampshire

6. Shyrokov A (2006) Setting up experiments to test a multi-

threaded speech user interface. Technical report ECE.P54.2006.1,

University of New Hampshire

7. Toh SL, Yang F, Heeman PA (2006) An annotation scheme for

agreement analysis. In: Proceedings of the 9th international

conference on spoken language processing (ICSLP-06), Pitts-

burgh PA

8. Shyrokov A, Kun A, Heeman P (2007) Experimental modeling of

human–human multi-threaded dialogues in the presence of a

manual-visual task. In: Proceedings of the 8th SIGdial workshop

on discourse and dialogue, Antwerp, Belgium

9. Anderson AH, Bader M, Bard EC, Boyle E, Doherty G, Garrod S,

Isard S, Kowtko J, McAllister J, Miller J, Sotillo C, Thompson H,

Weinert R (1991) The HCRC map task corpus. Lang Speech

34(4):351–366

10. Villing J, Holtelius C, Larsson S, Lindstrom A, Seward A, Aberg

N (2008) Interruption, resumption and domain switching in in-

vehicle dialogue. In: Proceedings of GoTAL, 6th international

conference on natural language processing, Gothenburg, Sweden

11. Lindstrom A, Villing J, Larsson S (2008) The effect of cognitive

load on disfluencies during in-vehicle spoken dialogue. In: Pro-

ceedings of Interspeech, Brisbane, Australia

12. Villing J (2010) Now, Where Was I? Resumption strategies for

an in-vehicle dialogue system. In: Proceedings of the 48th annual

meeting of the association for computational linguistics, Uppsala,

Sweden

13. Vollrath M (2007) Speech and driving-solution or problem? Intell

Transp Syst 1:89–94

14. Wickens CD (2002) Multiple resources and performance pre-

diction. Theor Issues Ergonom Sci 3(2):159–177

15. Kun AL, Shyrokov A, Heeman PA (2010) Spoken tasks for

human–human experiments: towards in-car speech user interfaces

for multi-threaded dialogue. In: Proceedings of 2nd international

conference on automotive user interfaces and interactive vehic-

ular applications (AutomotiveUI ’10), Pittsburgh, PA

16. Strayer DL, Johnston WA (2001) Driven to distraction: dual-task

studies of simulated driving and conversing on a cellular phone.

Psychol Sci 12(6):462–466

17. Tsimhoni O, Green P (1999) Visual demand of driving curves

determined by visual occlusion. In: Proceedings of the vision in

vehicles 8 conference, Boston, MA

18. Yang F, Heeman PA, Kun AL (2011) An investigation of inter-

ruptions and resumptions in multi-tasking dialogues. Comput

Linguist 37:1

19. Oviatt S, Darves C, Coulston R (2004) Toward adaptive con-

versational interfaces: modeling speech convergence with ani-

mated personas. Trans Comput Hum Interact 11:300–328

20. Palinko O, Kun AL, Shyrokov A, Heeman P (2010) Estimating

cognitive load using remote eye tracking in a driving simulator.

In: Proceedings of the eye tracking research and applications

conference, Austin, TX

21. Beatty J (1982) Task evoked pupillary responses, processing load

and structure of processing resources. Psychol Bull 91(2):

276–292

22. Vergauwe E, Barrouillet P, Camos V (2010) Do mental processes

share a domain-general resource? Psychol Sci 21(3):384–390

23. Drews FA, Pasupathi M, Strayer DL (2008) Passenger and cell

phone conversations in simulated driving. J Exp Psychol Appl

14(4):392–400

Pers Ubiquit Comput

123