comparing synthesized versus pre-recorded tutor speech in an intelligent tutoring spoken dialogue...
Post on 01-Jan-2016
216 Views
Preview:
TRANSCRIPT
Comparing Synthesized versus Pre-Recorded Tutor Speech
in an Intelligent Tutoring Spoken Dialogue System
Kate Forbes-Riley and Diane Litman and Scott Silliman and Joel Tetreault
Learning Research and Development CenterUniversity of Pittsburgh
Outline
Overview
System and Corpora
Evaluation Metrics and Methodology
Results
Conclusions, Future Work
Overview: Motivation Intelligent tutoring systems adding speech capabilities
(e.g. LISTEN Reading Tutor, SCoT, AutoTutor)
Enhance communication richness, increase effectiveness
Question: What is relationship between quality of speech technology and system effectiveness?
Is pre-recorded tutor voice (costly, inflexible, human) more effective than synthesized tutor voice (cheaper, flexible, non-human)?
If not, put effort elsewhere in system design!
Overview: Recent Work (mixed results)
Math Tutor System: (Non-)Visual tutor with prerecorded voice always rated higher and yielded deeper learning. (Atkinson et al., 2005)
Instructional Plan Tutor System: Pre-recorded voice always rated more engaging. Non-visual tutor: prerecorded voice yields more motivation. Visual tutor: synthesized voice yields more motivation. (Baylor et al., 2003)
Smart-Home System: More natural-sounding voice preferred. Characteristics (effort, pleasantness) more important than type (Moller et al., 2006)
Overview: Our Study Two Tutoring System Versions: pre-recorded tutor voice,
synthesized tutor voice (non-visual tutor)
Evaluate Effectiveness: student learning, system usability, dialogue efficiency across corpora (subsets)
Hypothesis: more human-sounding voice will perform better
Results: tutor voice quality has only a minor impact Does not impact learning May impact usability and efficiency: in certain corpora
subsets, pre-recorded preferred, in others, synthesized
• Back-end: text-based Why2-Atlas system (VanLehn et al., 2002)
Intelligent Tutoring Spoken Dialogue System
2 ITSPOKE 2005 Corpora
Pre-Recorded voice: paid voice talent, 5.85 hours of audio, 25 hours of time (at $120/hr)
Synthesized voice: Cepstral text-to-speech system voice of “Frank” for $29.95
Corpus #Students #Dialogues
PR 28 140
SYN 29 145
Example of 2 ITSPOKE Tutor Voices
TUTOR TURN: Right. Let's now analyze what happens to the keys. So what are the forces acting on the keys after they are released? Please, specify their directions (for instance, vertically up).
ITSPOKE Pre-Recorded Tutor Voice (PR)
ITSPOKE Synthesized Tutor Voice (SYN)
Experimental Procedure
Paid subjects w/o college physics recruited via UPitt ads: Read a small background document Took a pretest Worked 5 training problems (dialogues) with ITSPOKE Took a posttest Took a User Satisfaction Survey
Corpus #Students #Dialogues
PR 28 140
SYN 29 145
ITSPOKE User Satisfaction SurveyS1. It was easy to learn from the tutor.S2. The tutor interfered with my understanding of the content.S3. The tutor believed I was knowledgeable.S4. The tutor was useful.S5. The tutor was effective on conveying ideas.S6. The tutor was precise in providing advice.S7. The tutor helped me to concentrate.S8. It was easy to understand the tutor.S9. I knew what I could say or do at each point in the
conversations with the tutor.S10. The tutor worked the way I expected it to.S11. Based on my experience using the tutor to learn physics, I
would like to use such a tutor regularly.
ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3),RARELY (2), ALMOST NEVER (1)
Evaluation Metrics
Student Learning Gains SLG (standardized gain): posttest score – pretest score NLG (normalized gain): posttest score – pretest score
1 – pretest score
Dialogue Efficiency TOT (time on task): total time over all 5 dialogues (min.)
System Usability S# (S1 – S11): score for each survey statement
Evaluation Methodology For each of 14 evaluation metrics, compute a 2-tailed t-test
over student set in each corpus, for 13 student sets:
All students (PR and SYN) Students who may be more susceptible to tutor voice
quality, based on 3 criteria: Highest/High/Low/Lowest Time on Task Highest/High/Low/Lowest Pretest Score Highest/High/Low/Lowest Word Error Rate
• High/Low Partition: criterion median in corpora• Highest/Lowest Partition: cutoffs above/below median
Student Learning Results
No significant difference ( p < .05) in learning gains (SLG or NLG) for any of the 13 student sets
No trend for a significant difference (p < .10) in learning gains (SLG or NLG) for any of the 13 student sets
Students learned significantly in both conditions (p=.000)
Dialogue Efficiency Results
Metric
Student Set PR Mean SYN Mean
p
TOT Highest Pretest
100.9 (6) 121.9 (9) .09 Most knowledgeable SYN students may take more time
to read transcript (PR most efficient)
PR voice marginally slower than SYN voice (e.g. in our example, PR = 13 seconds, SYN = 10 seconds)
System Usability Results (1)
Metric
Student Set PR Mean SYN Mean
p
S3 All 3.50 (28) 3.00 (29) .05
S3 Highest TOT 3.40 (10) 2.64 (11) .06
S3 High WER 3.43 (14) 2.64 (14) .07 S3. The tutor believed I was knowledgeable: more human-like qualities attributed to more human voice (PR preferred)
System Usability Results (2)
Metric
Student Set PR Mean SYN Mean p
S11 High WER 1.86 (14) 2.64 (14) .08
S11 Highest WER
1.50 (6) 2.71 (7) .06
S11. Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly: more consistent with experience, not too human (SYN preferred)
System Usability Results (3)
Metric
Student Set PR Mean SYN Mean p
S2 Low WER 2.36 (14) 1.93 (14) .08
S2. The tutor interfered with my understanding of the content: when voice & WER human-like, notice inflexible NLU/G (SYN preferred)
Summary Evaluate impact of pre-recorded vs. synthesized tutor
voice on system effectiveness in ITSPOKE
Student Learning Results: no impact
Dialogue Efficiency Results: little impact High Pretest students took less time with PR (trend)
System Usability: little impact, mixed voice preference All, High Word Error Rate and Highest TOT students felt PR
believed them more knowledgeable (sig, trends) Low Word Error Rate students felt SYN interfered less (trend) High(est) Word Error Rate students preferred SYN for regular
use (trend)
Conclusions and Future Work Tutor voice quality has minimal impact in ITSPOKE
Why is text-to-speech sufficient in ITSPOKE context? Transcript dilutes impact of voice Students have time to get used to voice
Future Work: Show transcript after tutor speech/not at all Extend survey: how often is transcript read/how much
effort to understand voice (Moller et al., 2006) Try using other voices
top related