three's a crowd? effects of a second human on vocal

36
Three’s a Crowd? Effects of a Second Human on Vocal Accommodation with a Voice Assistant Eran Raveh, Ingo Siegert, Ingmar Steiner, Iona Gessinger, Bernd Möbius September 19, 2019 UNIVERSITÄT DES SAARLANDES

Upload: others

Post on 09-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Three's a Crowd? Effects of a Second Human on Vocal

Three’s a Crowd?Effects of a Second Human on

Vocal Accommodation with a Voice Assistant

Eran Raveh, Ingo Siegert, Ingmar Steiner, Iona Gessinger, Bernd Möbius

September 19, 2019

UNIVERSITÄTDESSAARLANDES

Page 2: Three's a Crowd? Effects of a Second Human on Vocal

Vocal accommodationhere, defined as

Mutual phonetic changes occurringover time during an interaction

⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)

1

Page 3: Three's a Crowd? Effects of a Second Human on Vocal

Vocal accommodationhere, defined as

Mutual phonetic changes occurringover time during an interaction

⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)

Occurs naturally in various HHI settings(e.g., Bailly and Lelong, 2010; Babel et al., 2014)

1

Page 4: Three's a Crowd? Effects of a Second Human on Vocal

Voice accommodation in HCI

Although Human-computer interaction (HCI) currently lacks themutuality of accommodation, effects can still be found(e.g., Staum Casasanto et al., 2010; Benuš et al., 2018)

Moreover, accommodation can be partially simulated in computers(e.g., Levitan et al., 2016; Raveh et al., 2018)

⇒ Demonstrated in both HHI and HCI separately. But –

Do people accommodate differently and/or to different extentwhen interacting with humans and computers simultaneously?

2

Page 5: Three's a Crowd? Effects of a Second Human on Vocal

Voice accommodation in HCI

Although Human-computer interaction (HCI) currently lacks themutuality of accommodation, effects can still be found(e.g., Staum Casasanto et al., 2010; Benuš et al., 2018)

Moreover, accommodation can be partially simulated in computers(e.g., Levitan et al., 2016; Raveh et al., 2018)

⇒ Demonstrated in both HHI and HCI separately. But –

Do people accommodate differently and/or to different extentwhen interacting with humans and computers simultaneously?

2

Page 6: Three's a Crowd? Effects of a Second Human on Vocal

Previous study

Do participants show different vocal behavior when talkingto alternating computer and human addressees?(Raveh et al., 2019)

3

Page 7: Three's a Crowd? Effects of a Second Human on Vocal

Previous study

0.00

0.03

0.06

0.09

0.12

100 125 150 175 200pitch

dens

itycontext

DD

HD

speaker

Alexa

confederate

participant

3

Page 8: Three's a Crowd? Effects of a Second Human on Vocal

Previous study

Do participants show different vocal behavior when talkingto alternating computer and human addressees? Yes(Raveh et al., 2019)

Interactions with significant difference

pitch (f0) intensity articulation rate (AR)74 % 89 % 13 %

Accommodative and non-accommodative temporal patternsemerged in both conditions.

3

Page 9: Three's a Crowd? Effects of a Second Human on Vocal

Voice Assistant Conversation Corpus(Siegert et al., 2018)

27 (14 female) German native speakers interacting with aconfederate and an Amazon Echo Dot device (w/ Alexa voice)2 scenarios carried out in solo and confederate conditions:Calendar – schedule a meeting; Quiz – answer trivia questions27×2×2 = 108 interactions; ∼13500 utterances; > 17 h

Solo condition Confederate condition

Participant Alexa Participant

Confederate

Alexa

4

Page 10: Three's a Crowd? Effects of a Second Human on Vocal

Voice Assistant Conversation Corpus(Siegert et al., 2018)

27 (14 female) German native speakers interacting with aconfederate and an Amazon Echo Dot device (w/ Alexa voice)2 scenarios carried out in solo and confederate conditions:Calendar – schedule a meeting; Quiz – answer trivia questions27×2×2 = 108 interactions; ∼13500 utterances; > 17 h

Solo condition Confederate condition

Participant Alexa Participant

Confederate

Alexa

4

Page 11: Three's a Crowd? Effects of a Second Human on Vocal

Method

DatasetAll 108 interactions from VACC

Turn times and speaker taken from annotations

Excluded turns annotated as cross-talk, off-talk, laughter, etc.

AnalysesDistributional – two-sample t-test of the measures betweencomputer-directed speech in solo and confederate conditions

Categorical – check the contribution of additional factors

(Temporal – compare trend changes of the measures over time)

5

Page 12: Three's a Crowd? Effects of a Second Human on Vocal

Method

DatasetAll 108 interactions from VACC

Turn times and speaker taken from annotations

Excluded turns annotated as cross-talk, off-talk, laughter, etc.

AnalysesDistributional – two-sample t-test of the measures betweencomputer-directed speech in solo and confederate conditions

Categorical – check the contribution of additional factors

(Temporal – compare trend changes of the measures over time)

5

Page 13: Three's a Crowd? Effects of a Second Human on Vocal

Features

Target featuresFundamental frequency (f0) – mean pitch

Intensity – mean intensity

Articulation rate (AR) – syllables to phonation time ratio(De Jong and Wempe, 2009)

Feature extractionTurns were sliced into non-overlapping segments of 2 s duration(+remainder) separated by speaker

Features were measured in each slice separately

6

Page 14: Three's a Crowd? Effects of a Second Human on Vocal

Features

Target featuresFundamental frequency (f0) – mean pitch

Intensity – mean intensity

Articulation rate (AR) – syllables to phonation time ratio(De Jong and Wempe, 2009)

Feature extractionTurns were sliced into non-overlapping segments of 2 s duration(+remainder) separated by speaker

Features were measured in each slice separately

6

Page 15: Three's a Crowd? Effects of a Second Human on Vocal

Previous study

vs.

This study

vs.

7

Page 16: Three's a Crowd? Effects of a Second Human on Vocal

Previous study

vs.

This study

vs.

7

Page 17: Three's a Crowd? Effects of a Second Human on Vocal

Results – distributional

any order (%) solo first (%) conf. first (%)

pitch (f0) 67 72 60intensity 67 76 56AR 30 31 28

Table: Percentage of solo-confederate conversation pairs with significantdifference between the distributions of the features in them.(two-sample Wilcoxon test; α = 0.05)

8

Page 18: Three's a Crowd? Effects of a Second Human on Vocal

Results – distributional

8

Page 19: Three's a Crowd? Effects of a Second Human on Vocal

Results – additional factorsTotal #features: 324 (108 × 3)⇒ 14% converged (agg. > mean + sd)

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task orderFactors

number of overall converged features 0 1 2

9

Page 20: Three's a Crowd? Effects of a Second Human on Vocal

Results – additional factorssex⇒ 24% difference (p < 0.01)

21

0

confederatesolo

Quiz

Calendar

secondfirst

conv. feat. condition task ordersex female male

9

Page 21: Three's a Crowd? Effects of a Second Human on Vocal

Results – additional factorscondition⇒ 20% difference (p < 0.05)

21

0

male

female

Quiz

Calendar

secondfirst

conv. feat. sex task ordercondition solo confederate

9

Page 22: Three's a Crowd? Effects of a Second Human on Vocal

Results – additional factorssex + solo condition⇒ 30% difference (p = 0.01)

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task ordernumber of overall converged features 1 2

9

Page 23: Three's a Crowd? Effects of a Second Human on Vocal

Motivation and application

With voice-activated products and conversation-based services,multi-party interactions are more likely to occur

Conversation in a complex activity, which involves many componentslearned naturally by human but are hard to wire into computers

⇒ Modeling different human behaviors can improveconversational TTS by producing both dynamic andresponsive speech output

10

Page 24: Three's a Crowd? Effects of a Second Human on Vocal

Motivation and application

With voice-activated products and conversation-based services,multi-party interactions are more likely to occur

Conversation in a complex activity, which involves many componentslearned naturally by human but are hard to wire into computers

⇒ Modeling different human behaviors can improveconversational TTS by producing both dynamic andresponsive speech output

10

Page 25: Three's a Crowd? Effects of a Second Human on Vocal

SummaryLarge overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence

more distributional differences when soloing firstmore aggregated accommodation in first task

Additional factors: females accommodated generally more; moreaccommodation in solo; no influence by the performed task

⇓Three IS a crowd – but its effect varies

Open questions

1 Would the effects be stronger with an accommodative system?

2 Could the system’s gender affect the user’s behavior?

11

Page 26: Three's a Crowd? Effects of a Second Human on Vocal

SummaryLarge overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence

more distributional differences when soloing firstmore aggregated accommodation in first task

Additional factors: females accommodated generally more; moreaccommodation in solo; no influence by the performed task

⇓Three IS a crowd – but its effect varies

Open questions

1 Would the effects be stronger with an accommodative system?

2 Could the system’s gender affect the user’s behavior?

11

Page 27: Three's a Crowd? Effects of a Second Human on Vocal

Three IS a crowd – but its effect varies

Large overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence

Thank you

VACC

12

Page 28: Three's a Crowd? Effects of a Second Human on Vocal

References IMolly Babel, Grant McGuire, Sophia Walters, and Alice Nicholls. Novelty and social preference in

phonetic accommodation. Laboratory Phonology, 5(1):123–150, February 2014.doi:10.1515/lp-2014-0006.

Gérard Bailly and Amélie Lelong. Speech dominoes and phonetic convergence. In Interspeech,pages 1153–1156, Makuhari, Chiba, Japan, September 2010. URLhttps://www.isca-speech.org/archive/interspeech_2010/i10_1153.html.

Štefan Benuš, Marian Trnka, Eduard Kuric, Lukáš Marták, Agustín Gravano, Julia Hirschberg,and Rivka Levitan. Prosodic entrainment and trust in human-computer interaction. InInternational Conference on Speech Prosody, pages 220–224, Poznan, Poland, June 2018.doi:10.21437/SpeechProsody.2018-45.

Nivja H De Jong and Ton Wempe. Praat script to detect syllable nuclei and measure speech rateautomatically. Behavior Research Methods, 41(2):385–390, May 2009.doi:10.3758/BRM.41.2.385.

Cindy Gallois and Howard Giles. Communication accommodation theory. In Karen Tracy,Cornelia Ilie, and Todd Sandel, editors, The International Encyclopedia of Language andSocial Interaction, pages 1–18. Wiley, 2015. doi:10.1002/9781118611463.wbielsi066.

Howard Giles, Nikolas Coupland, and Justine Coupland. Accommodation theory:Communication, context, and consequence. In Howard Giles, Justine Coupland, and NikolasCoupland, editors, Contexts of Accommodation: Developments in Applied Sociolinguistics,pages 1–68. Cambridge University Press, 1991. doi:10.1017/CBO9780511663673.001.

13

Page 29: Three's a Crowd? Effects of a Second Human on Vocal

References IIRivka Levitan, Stefan Benus, Ramiro H Gálvez, Agustín Gravano, Florencia Savoretti, Marian

Trnka, Andreas Weise, and Julia Hirschberg. Implementing acoustic-prosodic entrainment ina conversational avatar. In Interspeech, pages 1166–1170, San Francisco, CA, USA,September 2016. doi:10.21437/Interspeech.2016-985.

Eran Raveh, Ingmar Steiner, Iona Gessinger, and Bernd Möbius. Studying mutual phoneticinfluence with a web-based spoken dialogue system. In Alexey Karpov, Oliver Jokisch, andRodmonga Potapova, editors, 20th International Conference on Speech and Computer(Specom), volume 11096 of Lecture Notes in Artificial Intelligence, pages 552–562. Springer,September 2018. doi:10.1007/978-3-319-99579-3_57. URL https://arxiv.org/abs/1809.04945.

Eran Raveh, Ingmar Steiner, Ingo Siegert, Iona Gessinger, and Bernd Möbius. Comparingphonetic changes in computer-directed and human-directed speech. In 30th Conference onElectronic Speech Signal Processing (ESSV), pages 42–49, Dresden, Germany, March 2019.

Ingo Siegert, Julia Krüger, Olga Egorow, Jannik Nietzold, Ralph Heinemann, and Alicia Lotz.Voice Assistant Conversation Corpus (VACC): A multi-scenario dataset for addresseedetection in human-computer-interaction using Amazon’s ALEXA. In Workshop on Languageand Body in Real Life & Multimodal Corpora, Miyazaki, Japan, May 2018. URLhttp://lrec-conf.org/workshops/lrec2018/W20/pdf/13_W20.pdf.

Laura Staum Casasanto, Kyle Jasmin, and Daniel Casasanto. Virtually accommodating: Speechrate accommodation to a virtual interlocutor. In 32nd Annual meeting of the CognitiveScience Society (CogSci 2010), pages 127–132, Portland, OR, USA, August 2010. URLhttp://csjarchive.cogsci.rpi.edu/proceedings/2010/papers/0020/.

14

Page 30: Three's a Crowd? Effects of a Second Human on Vocal

Additional factors – non-strict

male

female

confederatesolo

Quiz

Calendar

secondfirst

sex condition task orderFactors

number of overall converged features 0 1 2 3

15

Page 31: Three's a Crowd? Effects of a Second Human on Vocal

Additional factors – full analysis

Percentages of differences between the categories of each factor.Total number of features: 324 (108 interaction times 3 features)

conv. features ∆Sex ∆Order ∆Condition ∆Task

relaxed(mean)

47% 6% 0% 2% 4%

strict(mean+sd)

14% 24%** 2% 20%* 2%

* = p < 0.05 ** = p < 0.01

16

Page 32: Three's a Crowd? Effects of a Second Human on Vocal

Temporal analysis

Changes in participant’s part in overall accommodation (bottom half)

0 50 100

slice number (1 slice = 2 seconds)

pitc

h tr

ends

Solo condition

0 100 200 300 400 500

slice number (1 slice = 2 seconds)

pitc

h tr

ends

Confederate condition speaker

participant

Alexa

NA

divergence

no change /synchrony

convergence

change

speaker

Alexa

participant

(1) changet =−∆t,t−1 | Spart −SAlexa |(2) accomm(participant)t = changet −∆t,t−1SAlexa

17

Page 33: Three's a Crowd? Effects of a Second Human on Vocal

Previous study – temporal

●●

●●

●●

●●● ●

●●

●●

●●●

● ●

●●●

●●●

●●

●●●●

●●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●●

●●

●●

●●●

●●●●

●●●

●●●●●

●●●

●●●●●

●●

●●

●●

●●

DD

HD

0 100 200 300

100

150

200

250

300

100

150

200

250

300

sliceNo

pitc

h

speaker ● ● ●Alexa confederate participant

18

Page 34: Three's a Crowd? Effects of a Second Human on Vocal

Web-based system

19

Page 35: Three's a Crowd? Effects of a Second Human on Vocal

Responsive SDS

ASR Automatic speech

recognition

NLU Natural language

understanding

NLG Natural language

generation

DM Dialogue management

TTS Text-to-speech

synthesis

text

text

semantics

semantics

audio

audio

ASP Additional speech

processing

signal

features

20

Page 36: Three's a Crowd? Effects of a Second Human on Vocal

Online/offline paths

21