three's a crowd? effects of a second human on vocal
TRANSCRIPT
Three’s a Crowd?Effects of a Second Human on
Vocal Accommodation with a Voice Assistant
Eran Raveh, Ingo Siegert, Ingmar Steiner, Iona Gessinger, Bernd Möbius
September 19, 2019
UNIVERSITÄTDESSAARLANDES
Vocal accommodationhere, defined as
Mutual phonetic changes occurringover time during an interaction
⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)
1
Vocal accommodationhere, defined as
Mutual phonetic changes occurringover time during an interaction
⇑Instantiation of the communication accommodation theory (CAT)(Giles et al., 1991; Gallois and Giles, 2015)
Occurs naturally in various HHI settings(e.g., Bailly and Lelong, 2010; Babel et al., 2014)
1
Voice accommodation in HCI
Although Human-computer interaction (HCI) currently lacks themutuality of accommodation, effects can still be found(e.g., Staum Casasanto et al., 2010; Benuš et al., 2018)
Moreover, accommodation can be partially simulated in computers(e.g., Levitan et al., 2016; Raveh et al., 2018)
⇒ Demonstrated in both HHI and HCI separately. But –
Do people accommodate differently and/or to different extentwhen interacting with humans and computers simultaneously?
2
Voice accommodation in HCI
Although Human-computer interaction (HCI) currently lacks themutuality of accommodation, effects can still be found(e.g., Staum Casasanto et al., 2010; Benuš et al., 2018)
Moreover, accommodation can be partially simulated in computers(e.g., Levitan et al., 2016; Raveh et al., 2018)
⇒ Demonstrated in both HHI and HCI separately. But –
Do people accommodate differently and/or to different extentwhen interacting with humans and computers simultaneously?
2
Previous study
Do participants show different vocal behavior when talkingto alternating computer and human addressees?(Raveh et al., 2019)
3
Previous study
0.00
0.03
0.06
0.09
0.12
100 125 150 175 200pitch
dens
itycontext
DD
HD
speaker
Alexa
confederate
participant
3
Previous study
Do participants show different vocal behavior when talkingto alternating computer and human addressees? Yes(Raveh et al., 2019)
Interactions with significant difference
pitch (f0) intensity articulation rate (AR)74 % 89 % 13 %
Accommodative and non-accommodative temporal patternsemerged in both conditions.
3
Voice Assistant Conversation Corpus(Siegert et al., 2018)
27 (14 female) German native speakers interacting with aconfederate and an Amazon Echo Dot device (w/ Alexa voice)2 scenarios carried out in solo and confederate conditions:Calendar – schedule a meeting; Quiz – answer trivia questions27×2×2 = 108 interactions; ∼13500 utterances; > 17 h
Solo condition Confederate condition
Participant Alexa Participant
Confederate
Alexa
4
Voice Assistant Conversation Corpus(Siegert et al., 2018)
27 (14 female) German native speakers interacting with aconfederate and an Amazon Echo Dot device (w/ Alexa voice)2 scenarios carried out in solo and confederate conditions:Calendar – schedule a meeting; Quiz – answer trivia questions27×2×2 = 108 interactions; ∼13500 utterances; > 17 h
Solo condition Confederate condition
Participant Alexa Participant
Confederate
Alexa
4
Method
DatasetAll 108 interactions from VACC
Turn times and speaker taken from annotations
Excluded turns annotated as cross-talk, off-talk, laughter, etc.
AnalysesDistributional – two-sample t-test of the measures betweencomputer-directed speech in solo and confederate conditions
Categorical – check the contribution of additional factors
(Temporal – compare trend changes of the measures over time)
5
Method
DatasetAll 108 interactions from VACC
Turn times and speaker taken from annotations
Excluded turns annotated as cross-talk, off-talk, laughter, etc.
AnalysesDistributional – two-sample t-test of the measures betweencomputer-directed speech in solo and confederate conditions
Categorical – check the contribution of additional factors
(Temporal – compare trend changes of the measures over time)
5
Features
Target featuresFundamental frequency (f0) – mean pitch
Intensity – mean intensity
Articulation rate (AR) – syllables to phonation time ratio(De Jong and Wempe, 2009)
Feature extractionTurns were sliced into non-overlapping segments of 2 s duration(+remainder) separated by speaker
Features were measured in each slice separately
6
Features
Target featuresFundamental frequency (f0) – mean pitch
Intensity – mean intensity
Articulation rate (AR) – syllables to phonation time ratio(De Jong and Wempe, 2009)
Feature extractionTurns were sliced into non-overlapping segments of 2 s duration(+remainder) separated by speaker
Features were measured in each slice separately
6
Previous study
vs.
This study
vs.
7
Previous study
vs.
This study
vs.
7
Results – distributional
any order (%) solo first (%) conf. first (%)
pitch (f0) 67 72 60intensity 67 76 56AR 30 31 28
Table: Percentage of solo-confederate conversation pairs with significantdifference between the distributions of the features in them.(two-sample Wilcoxon test; α = 0.05)
8
Results – distributional
8
Results – additional factorsTotal #features: 324 (108 × 3)⇒ 14% converged (agg. > mean + sd)
male
female
confederatesolo
Quiz
Calendar
secondfirst
sex condition task orderFactors
number of overall converged features 0 1 2
9
Results – additional factorssex⇒ 24% difference (p < 0.01)
21
0
confederatesolo
Quiz
Calendar
secondfirst
conv. feat. condition task ordersex female male
9
Results – additional factorscondition⇒ 20% difference (p < 0.05)
21
0
male
female
Quiz
Calendar
secondfirst
conv. feat. sex task ordercondition solo confederate
9
Results – additional factorssex + solo condition⇒ 30% difference (p = 0.01)
male
female
confederatesolo
Quiz
Calendar
secondfirst
sex condition task ordernumber of overall converged features 1 2
9
Motivation and application
With voice-activated products and conversation-based services,multi-party interactions are more likely to occur
Conversation in a complex activity, which involves many componentslearned naturally by human but are hard to wire into computers
⇒ Modeling different human behaviors can improveconversational TTS by producing both dynamic andresponsive speech output
10
Motivation and application
With voice-activated products and conversation-based services,multi-party interactions are more likely to occur
Conversation in a complex activity, which involves many componentslearned naturally by human but are hard to wire into computers
⇒ Modeling different human behaviors can improveconversational TTS by producing both dynamic andresponsive speech output
10
SummaryLarge overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence
more distributional differences when soloing firstmore aggregated accommodation in first task
Additional factors: females accommodated generally more; moreaccommodation in solo; no influence by the performed task
⇓Three IS a crowd – but its effect varies
Open questions
1 Would the effects be stronger with an accommodative system?
2 Could the system’s gender affect the user’s behavior?
11
SummaryLarge overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence
more distributional differences when soloing firstmore aggregated accommodation in first task
Additional factors: females accommodated generally more; moreaccommodation in solo; no influence by the performed task
⇓Three IS a crowd – but its effect varies
Open questions
1 Would the effects be stronger with an accommodative system?
2 Could the system’s gender affect the user’s behavior?
11
Three IS a crowd – but its effect varies
Large overall difference in phonetic accommodation betweenhumans-computer and human-human-computer interactionsHigher degrees of accommodation due to precedence
Thank you
VACC
12
References IMolly Babel, Grant McGuire, Sophia Walters, and Alice Nicholls. Novelty and social preference in
phonetic accommodation. Laboratory Phonology, 5(1):123–150, February 2014.doi:10.1515/lp-2014-0006.
Gérard Bailly and Amélie Lelong. Speech dominoes and phonetic convergence. In Interspeech,pages 1153–1156, Makuhari, Chiba, Japan, September 2010. URLhttps://www.isca-speech.org/archive/interspeech_2010/i10_1153.html.
Štefan Benuš, Marian Trnka, Eduard Kuric, Lukáš Marták, Agustín Gravano, Julia Hirschberg,and Rivka Levitan. Prosodic entrainment and trust in human-computer interaction. InInternational Conference on Speech Prosody, pages 220–224, Poznan, Poland, June 2018.doi:10.21437/SpeechProsody.2018-45.
Nivja H De Jong and Ton Wempe. Praat script to detect syllable nuclei and measure speech rateautomatically. Behavior Research Methods, 41(2):385–390, May 2009.doi:10.3758/BRM.41.2.385.
Cindy Gallois and Howard Giles. Communication accommodation theory. In Karen Tracy,Cornelia Ilie, and Todd Sandel, editors, The International Encyclopedia of Language andSocial Interaction, pages 1–18. Wiley, 2015. doi:10.1002/9781118611463.wbielsi066.
Howard Giles, Nikolas Coupland, and Justine Coupland. Accommodation theory:Communication, context, and consequence. In Howard Giles, Justine Coupland, and NikolasCoupland, editors, Contexts of Accommodation: Developments in Applied Sociolinguistics,pages 1–68. Cambridge University Press, 1991. doi:10.1017/CBO9780511663673.001.
13
References IIRivka Levitan, Stefan Benus, Ramiro H Gálvez, Agustín Gravano, Florencia Savoretti, Marian
Trnka, Andreas Weise, and Julia Hirschberg. Implementing acoustic-prosodic entrainment ina conversational avatar. In Interspeech, pages 1166–1170, San Francisco, CA, USA,September 2016. doi:10.21437/Interspeech.2016-985.
Eran Raveh, Ingmar Steiner, Iona Gessinger, and Bernd Möbius. Studying mutual phoneticinfluence with a web-based spoken dialogue system. In Alexey Karpov, Oliver Jokisch, andRodmonga Potapova, editors, 20th International Conference on Speech and Computer(Specom), volume 11096 of Lecture Notes in Artificial Intelligence, pages 552–562. Springer,September 2018. doi:10.1007/978-3-319-99579-3_57. URL https://arxiv.org/abs/1809.04945.
Eran Raveh, Ingmar Steiner, Ingo Siegert, Iona Gessinger, and Bernd Möbius. Comparingphonetic changes in computer-directed and human-directed speech. In 30th Conference onElectronic Speech Signal Processing (ESSV), pages 42–49, Dresden, Germany, March 2019.
Ingo Siegert, Julia Krüger, Olga Egorow, Jannik Nietzold, Ralph Heinemann, and Alicia Lotz.Voice Assistant Conversation Corpus (VACC): A multi-scenario dataset for addresseedetection in human-computer-interaction using Amazon’s ALEXA. In Workshop on Languageand Body in Real Life & Multimodal Corpora, Miyazaki, Japan, May 2018. URLhttp://lrec-conf.org/workshops/lrec2018/W20/pdf/13_W20.pdf.
Laura Staum Casasanto, Kyle Jasmin, and Daniel Casasanto. Virtually accommodating: Speechrate accommodation to a virtual interlocutor. In 32nd Annual meeting of the CognitiveScience Society (CogSci 2010), pages 127–132, Portland, OR, USA, August 2010. URLhttp://csjarchive.cogsci.rpi.edu/proceedings/2010/papers/0020/.
14
Additional factors – non-strict
male
female
confederatesolo
Quiz
Calendar
secondfirst
sex condition task orderFactors
number of overall converged features 0 1 2 3
15
Additional factors – full analysis
Percentages of differences between the categories of each factor.Total number of features: 324 (108 interaction times 3 features)
conv. features ∆Sex ∆Order ∆Condition ∆Task
relaxed(mean)
47% 6% 0% 2% 4%
strict(mean+sd)
14% 24%** 2% 20%* 2%
* = p < 0.05 ** = p < 0.01
16
Temporal analysis
Changes in participant’s part in overall accommodation (bottom half)
0 50 100
slice number (1 slice = 2 seconds)
pitc
h tr
ends
Solo condition
0 100 200 300 400 500
slice number (1 slice = 2 seconds)
pitc
h tr
ends
Confederate condition speaker
participant
Alexa
NA
divergence
no change /synchrony
convergence
change
speaker
Alexa
participant
(1) changet =−∆t,t−1 | Spart −SAlexa |(2) accomm(participant)t = changet −∆t,t−1SAlexa
17
Previous study – temporal
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
● ●
●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●●●
●●●●
●
●●●
●
●●
●
●●
●
●
●
●●●
●
●●●●
●
●
●
●●●
●
●
●
●●●●●
●
●
●
●●●
●
●
●●●●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
DD
HD
0 100 200 300
100
150
200
250
300
100
150
200
250
300
sliceNo
pitc
h
speaker ● ● ●Alexa confederate participant
18
Web-based system
19
Responsive SDS
ASR Automatic speech
recognition
NLU Natural language
understanding
NLG Natural language
generation
DM Dialogue management
TTS Text-to-speech
synthesis
text
text
semantics
semantics
audio
audio
ASP Additional speech
processing
signal
features
20
Online/offline paths
21