three's a crowd? effects of a second human on vocal

Three’s a Crowd?Effects of a Second Human on

Vocal Accommodation with a Voice Assistant

Eran Raveh, Ingo Siegert, Ingmar Steiner, Iona Gessinger, Bernd Möbius

September 19, 2019

UNIVERSITÄTDESSAARLANDES

Vocal accommodationhere, defined as

Mutual phonetic changes occurringover time during an interaction

⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)

1

Vocal accommodationhere, defined as

Mutual phonetic changes occurringover time during an interaction

⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)

Occurs naturally in various HHI settings(e.g., Bailly and Lelong, 2010; Babel et al., 2014)

1

Voice accommodation in HCI

Although Human-computer interaction (HCI) currently lacks themutuality of accommodation, effects can still be found(e.g., Staum Casasanto et al., 2010; Benuš et al., 2018)

Moreover, accommodation can be partially simulated in computers(e.g., Levitan et al., 2016; Raveh et al., 2018)

⇒ Demonstrated in both HHI and HCI separately. But –

Do people accommodate differently and/or to different extentwhen interacting with humans and computers simultaneously?

2

Previous study

Do participants show different vocal behavior when talkingto alternating computer and human addressees?(Raveh et al., 2019)

3

Previous study

0.00

0.03

0.06

0.09

0.12

100 125 150 175 200pitch

dens

itycontext

DD

HD

speaker

Alexa

confederate

participant

3

Previous study

Do participants show different vocal behavior when talkingto alternating computer and human addressees? Yes(Raveh et al., 2019)

Interactions with significant difference

pitch (f0) intensity articulation rate (AR)74 % 89 % 13 %

Accommodative and non-accommodative temporal patternsemerged in both conditions.

3

Voice Assistant Conversation Corpus(Siegert et al., 2018)

27 (14 female) German native speakers interacting with aconfederate and an Amazon Echo Dot device (w/ Alexa voice)2 scenarios carried out in solo and confederate conditions:Calendar – schedule a meeting; Quiz – answer trivia questions27×2×2 = 108 interactions; ∼13500 utterances; > 17 h

Solo condition Confederate condition

Participant Alexa Participant

Confederate

Alexa

4

Method

DatasetAll 108 interactions from VACC

Turn times and speaker taken from annotations

Excluded turns annotated as cross-talk, off-talk, laughter, etc.

AnalysesDistributional – two-sample t-test of the measures betweencomputer-directed speech in solo and confederate conditions

Categorical – check the contribution of additional factors

(Temporal – compare trend changes of the measures over time)

5

Features

Target featuresFundamental frequency (f0) – mean pitch

Intensity – mean intensity

Articulation rate (AR) – syllables to phonation time ratio(De Jong and Wempe, 2009)

Feature extractionTurns were sliced into non-overlapping segments of 2 s duration(+remainder) separated by speaker

Features were measured in each slice separately

6

Previous study

vs.

This study

vs.

7

Results – distributional

any order (%) solo first (%) conf. first (%)

pitch (f0) 67 72 60intensity 67 76 56AR 30 31 28

Table: Percentage of solo-confederate conversation pairs with significantdifference between the distributions of the features in them.(two-sample Wilcoxon test; α = 0.05)

8

Results – distributional

8

Results – additional factorsTotal #features: 324 (108 × 3)⇒ 14% converged (agg. > mean + sd)

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task orderFactors

number of overall converged features 0 1 2

9

Results – additional factorssex⇒ 24% difference (p < 0.01)

21

0

confederatesolo

Quiz

Calendar

secondfirst

conv. feat. condition task ordersex female male

9

Results – additional factorscondition⇒ 20% difference (p < 0.05)

21

0

male

female

Quiz

Calendar

secondfirst

conv. feat. sex task ordercondition solo confederate

9

Results – additional factorssex + solo condition⇒ 30% difference (p = 0.01)

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task ordernumber of overall converged features 1 2

9

Motivation and application

With voice-activated products and conversation-based services,multi-party interactions are more likely to occur

Conversation in a complex activity, which involves many componentslearned naturally by human but are hard to wire into computers

⇒ Modeling different human behaviors can improveconversational TTS by producing both dynamic andresponsive speech output

10

SummaryLarge overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence

more distributional differences when soloing firstmore aggregated accommodation in first task

Additional factors: females accommodated generally more; moreaccommodation in solo; no influence by the performed task

⇓Three IS a crowd – but its effect varies

Open questions

1 Would the effects be stronger with an accommodative system?

2 Could the system’s gender affect the user’s behavior?

11

Three IS a crowd – but its effect varies

Large overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence

Thank you

VACC

12

http://www.iikt.ovgu.de/en/Research+Groups/MDS/Research/VACC.html

References IMolly Babel, Grant McGuire, Sophia Walters, and Alice Nicholls. Novelty and social preference in

phonetic accommodation. Laboratory Phonology, 5(1):123–150, February 2014.doi:10.1515/lp-2014-0006.

Gérard Bailly and Amélie Lelong. Speech dominoes and phonetic convergence. In Interspeech,pages 1153–1156, Makuhari, Chiba, Japan, September 2010. URLhttps://www.isca-speech.org/archive/interspeech_2010/i10_1153.html.

Štefan Benuš, Marian Trnka, Eduard Kuric, Lukáš Marták, Agustín Gravano, Julia Hirschberg,and Rivka Levitan. Prosodic entrainment and trust in human-computer interaction. InInternational Conference on Speech Prosody, pages 220–224, Poznan, Poland, June 2018.doi:10.21437/SpeechProsody.2018-45.

Nivja H De Jong and Ton Wempe. Praat script to detect syllable nuclei and measure speech rateautomatically. Behavior Research Methods, 41(2):385–390, May 2009.doi:10.3758/BRM.41.2.385.

Cindy Gallois and Howard Giles. Communication accommodation theory. In Karen Tracy,Cornelia Ilie, and Todd Sandel, editors, The International Encyclopedia of Language andSocial Interaction, pages 1–18. Wiley, 2015. doi:10.1002/9781118611463.wbielsi066.

Howard Giles, Nikolas Coupland, and Justine Coupland. Accommodation theory:Communication, context, and consequence. In Howard Giles, Justine Coupland, and NikolasCoupland, editors, Contexts of Accommodation: Developments in Applied Sociolinguistics,pages 1–68. Cambridge University Press, 1991. doi:10.1017/CBO9780511663673.001.

13

https://doi.org/10.1515/lp-2014-0006

https://www.isca-speech.org/archive/interspeech_2010/i10_1153.html

https://doi.org/10.21437/SpeechProsody.2018-45

https://doi.org/10.3758/BRM.41.2.385

https://doi.org/10.1002/9781118611463.wbielsi066

https://doi.org/10.1017/CBO9780511663673.001

References IIRivka Levitan, Stefan Benus, Ramiro H Gálvez, Agustín Gravano, Florencia Savoretti, Marian

Trnka, Andreas Weise, and Julia Hirschberg. Implementing acoustic-prosodic entrainment ina conversational avatar. In Interspeech, pages 1166–1170, San Francisco, CA, USA,September 2016. doi:10.21437/Interspeech.2016-985.

Eran Raveh, Ingmar Steiner, Iona Gessinger, and Bernd Möbius. Studying mutual phoneticinfluence with a web-based spoken dialogue system. In Alexey Karpov, Oliver Jokisch, andRodmonga Potapova, editors, 20th International Conference on Speech and Computer(Specom), volume 11096 of Lecture Notes in Artificial Intelligence, pages 552–562. Springer,September 2018. doi:10.1007/978-3-319-99579-3_57. URL https://arxiv.org/abs/1809.04945.

Eran Raveh, Ingmar Steiner, Ingo Siegert, Iona Gessinger, and Bernd Möbius. Comparingphonetic changes in computer-directed and human-directed speech. In 30th Conference onElectronic Speech Signal Processing (ESSV), pages 42–49, Dresden, Germany, March 2019.

Ingo Siegert, Julia Krüger, Olga Egorow, Jannik Nietzold, Ralph Heinemann, and Alicia Lotz.Voice Assistant Conversation Corpus (VACC): A multi-scenario dataset for addresseedetection in human-computer-interaction using Amazon’s ALEXA. In Workshop on Languageand Body in Real Life & Multimodal Corpora, Miyazaki, Japan, May 2018. URLhttp://lrec-conf.org/workshops/lrec2018/W20/pdf/13_W20.pdf.

Laura Staum Casasanto, Kyle Jasmin, and Daniel Casasanto. Virtually accommodating: Speechrate accommodation to a virtual interlocutor. In 32nd Annual meeting of the CognitiveScience Society (CogSci 2010), pages 127–132, Portland, OR, USA, August 2010. URLhttp://csjarchive.cogsci.rpi.edu/proceedings/2010/papers/0020/.

14

https://doi.org/10.21437/Interspeech.2016-985

https://doi.org/10.1007/978-3-319-99579-3_57

https://arxiv.org/abs/1809.04945

http://lrec-conf.org/workshops/lrec2018/W20/pdf/13_W20.pdf

http://csjarchive.cogsci.rpi.edu/proceedings/2010/papers/0020/

Additional factors – non-strict

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task orderFactors

number of overall converged features 0 1 2 3

15

Additional factors – full analysis

Percentages of differences between the categories of each factor.Total number of features: 324 (108 interaction times 3 features)

conv. features ∆Sex ∆Order ∆Condition ∆Task

relaxed(mean)

47% 6% 0% 2% 4%

strict(mean+sd)

14% 24%** 2% 20%* 2%

* = p < 0.05 ** = p < 0.01

16

Temporal analysis

Changes in participant’s part in overall accommodation (bottom half)

0 50 100

slice number (1 slice = 2 seconds)

pitc

h tr

ends

Solo condition

0 100 200 300 400 500

slice number (1 slice = 2 seconds)

pitc

h tr

ends

Confederate condition speaker

participant

Alexa

NA

divergence

no change /synchrony

convergence

change

speaker

Alexa

participant

(1) changet =−∆t,t−1 | Spart −SAlexa |(2) accomm(participant)t = changet −∆t,t−1SAlexa

17

Previous study – temporal

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

● ●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●●●●

●●●●

●

●●●

●

●●

●

●●

●

●

●

●●●

●

●●●●

●

●

●

●●●

●

●

●

●●●●●

●

●

●

●●●

●

●

●●●●●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

DD

HD

0 100 200 300

100

150

200

250

300

100

150

200

250

300

sliceNo

pitc

h

speaker ● ● ●Alexa confederate participant

18

Web-based system

19

Responsive SDS

ASR Automatic speech

recognition

NLU Natural language

understanding

NLG Natural language

generation

DM Dialogue management

TTS Text-to-speech

synthesis

text

text

semantics

semantics

audio

audio

ASP Additional speech

processing

signal

features

20

Online/offline paths

21

three's a crowd? effects of a second human on vocal

Documents