text versus speech: a comparison of tagging input modalities for camera phones

Text vs. Speech A Comparison of Tagging Input Modalities

for Camera Phones

Research & Development

Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira

people do not want to tag their pictures

intro → hypotheses → methodology → results → implications

research question:

Assuming that users are willing to input at least one tag, which input

modality can help the production and retrieval of the pictures?


hypothesis 1

Speech is preferred to text as an annotation mechanism on mobile

phones (objective measure)

Support: - Mitchard and Winkles (2002)


hypothesis 1-bis

Speech annotations are preferred by users even if this means spending more time on the task (subjective measure)

Support: - Perakakis and Potamianos (2008)


hypothesis 2

The longer the tag the larger the advantage of voice over text for

annotating pictures on mobile phones

Support: - Hauptmann and Rudnicky (1990)


hypothesis 3

Retrieving pictures on mobile phones with speech is not faster than with text

(objective measure)

Support: - Mills et al. (2000)


the user study


field study (4 weeks)

controlled experiment

T1 - T2 - T3 - T4

3 experimental conditions: a. Speech only

b. Text only c. Speech and Text


MAMI


features of MAMI

•  processing is done entirely on the mobile phone

•  speech is not transcribed

•  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping

task 1: remember the tag


stimulus retrieval

Pictures taken during the field trial

task 2: remember the context


stimulus retrieval

TASK 2 PICTURE 1

three little bushes Garden Tree Stairs

task 3: remember the picture


stimulus retrieval

Text Audio tags were converted into

textual tags and vice versa

task 4: remember the sequence


assignment retrieval

TASK 4

Three pictures among the oldest and three pictures among the newest.

metrics


•  time to completion

•  false positives

•  retrieval errors

results H1


results H1-bis

All participants in the BOTH group felt that tagging with text was more effective than tagging with voice.

Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree


results H2


results H3


results H3 - continued

take away 1: �speech is not a given

the advantage of audio as an input modality for tagging pictures on mobile phones is not a given

why? 1. retrieval precision

2. privacy


take away 2: �input mistakes

we address text input mistakes immediately. on the contrary mistakes in audio recordings are less

frequently addressed


take away 3: �memory

speech does not help memorizing the tags


implication 2:�enable audio inspection


end�thanks

[email protected] [email protected]

http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/

Research & Development

text versus speech: a comparison of tagging input modalities for camera phones

Technology

results h1bisall participants

taggingwith text

annotating pictures

retrieving pictures

input modality

speech onlyb

mobilephone speech

text objective measure