text versus speech: a comparison of tagging input modalities for camera phones
DESCRIPTION
Speech and typed text are two common input modalities for mobile phones. However, little research has compared them in their ability to support annotation and retrieval of digital pictures on mobile devices. In this paper, we report the results of a month-long field study in which participants took pictures with their camera phones and had the choice of adding annotations using speech, typed text,or both. Subsequently, the same subjects participated in a controlled experiment where they were asked to retrieve images based on annotations as well as retrieve annotations based on images in order to study the ability of each modality to effectively support users' recall of the previously captured pictures. Resultsdemonstrate that each modality has advantages and shortcomings for the production of tags and retrieval of pictures. Several guidelines are suggested when designing tagging applications for portable devices.TRANSCRIPT
Text vs. Speech A Comparison of Tagging Input Modalities
for Camera Phones
Research & Development
Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira
people do not want to tag their pictures
intro → hypotheses → methodology → results → implications
research question:
Assuming that users are willing to input at least one tag, which input
modality can help the production and retrieval of the pictures?
intro → hypotheses → methodology → results → implications
hypothesis 1
Speech is preferred to text as an annotation mechanism on mobile
phones (objective measure)
Support: - Mitchard and Winkles (2002)
intro → hypotheses → methodology → results → implications
hypothesis 1-bis
Speech annotations are preferred by users even if this means spending more time on the task (subjective measure)
Support: - Perakakis and Potamianos (2008)
intro → hypotheses → methodology → results → implications
hypothesis 2
The longer the tag the larger the advantage of voice over text for
annotating pictures on mobile phones
Support: - Hauptmann and Rudnicky (1990)
intro → hypotheses → methodology → results → implications
hypothesis 3
Retrieving pictures on mobile phones with speech is not faster than with text
(objective measure)
Support: - Mills et al. (2000)
intro → hypotheses → methodology → results → implications
the user study
intro → hypotheses → methodology → results → implications
field study (4 weeks)
controlled experiment
T1 - T2 - T3 - T4
3 experimental conditions: a. Speech only
b. Text only c. Speech and Text
intro → hypotheses → methodology → results → implications
MAMI
intro → hypotheses → methodology → results → implications
features of MAMI
• processing is done entirely on the mobile phone
• speech is not transcribed
• to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping
task 1: remember the tag
intro → hypotheses → methodology → results → implications
stimulus retrieval
Pictures taken during the field trial
task 2: remember the context
intro → hypotheses → methodology → results → implications
stimulus retrieval
TASK 2 PICTURE 1
three little bushes Garden Tree Stairs
task 3: remember the picture
intro → hypotheses → methodology → results → implications
stimulus retrieval
Text Audio tags were converted into
textual tags and vice versa
task 4: remember the sequence
intro → hypotheses → methodology → results → implications
assignment retrieval
TASK 4
Three pictures among the oldest and three pictures among the newest.
metrics
intro → hypotheses → methodology → results → implications
• time to completion
• false positives
• retrieval errors
results H1
intro → hypotheses → methodology → results → implications
results H1-bis
All participants in the BOTH group felt that tagging with text was more effective than tagging with voice.
Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree
intro → hypotheses → methodology → results → implications
results H2
intro → hypotheses → methodology → results → implications
results H3
intro → hypotheses → methodology → results → implications
results H3 - continued
take away 1: �speech is not a given
the advantage of audio as an input modality for tagging pictures on mobile phones is not a given
why? 1. retrieval precision
2. privacy
intro → hypotheses → methodology → results → implications
take away 2: �input mistakes
we address text input mistakes immediately. on the contrary mistakes in audio recordings are less
frequently addressed
intro → hypotheses → methodology → results → implications
take away 3: �memory
speech does not help memorizing the tags
intro → hypotheses → methodology → results → implications
implication 1:�allow multiple modalities
© Pixar, 2008
intro → hypotheses → methodology → results → implications
implication 2:�enable audio inspection
intro → hypotheses → methodology → results → implications
implication 3: �enable modality synesthesia
© Disney, 1940
intro → hypotheses → methodology → results → implications
end�thanks
[email protected] [email protected]
http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/
Research & Development