multimodal interaction in speak4it
DESCRIPTION
Rethink Possible. Multimodal Interaction in Speak4it. Patrick Ehlen AT & T. This talk will discuss…. Multimodal interaction approaches mode choice mode integration Grounding (It’s context!) Grounding in multimodal local search. What is multimodal interaction?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/1.jpg)
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.
Multimodal Interaction in Speak4it
Patrick EhlenAT&T
Rethink Possible
![Page 2: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/2.jpg)
Page 2
This talk will discuss….• Multimodal interaction approaches– mode choice– mode integration
• Grounding (It’s context!)• Grounding in multimodal local search
![Page 3: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/3.jpg)
Page 3
What is multimodal interaction?• The most common implementation of
“multimodal interaction” – mode choice• Let people use more than one mode of input
or output– Input: Graphical UI or voice (ASR)– Output: Visual (graphics) or voice (TTS)– Interact using one mode at a time
![Page 4: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/4.jpg)
Page 4
Another approach….• mode integration– Use more than one mode
at the same time– Provide simultaneous
information usingdifferent channels
– Combine information from different modes into one interpretation
“Italian restaurants near here”
![Page 5: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/5.jpg)
Page 5
Advantages….• It’s natural
(underspecification is the norm)• Adapt to environment• Speech can be shorter and more
simple and/or communicate more complex information
• Complete tasks more quickly“Italian restaurants near here”
![Page 6: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/6.jpg)
Page 6
Advantages….• Some content is better
communicated by modes other than speech (e.g., gesturing to communicate spatial information)
• Information from different modes can complement one other and resolve ambiguities (“mutual compensation”)
“Italian restaurants near here”
![Page 7: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/7.jpg)
History of research prototypes– MATCH (Johnston et al 2002)– AdApt (Gustafson et al 2000)– SmartKom mobile (Wahlster 2006)– Multimodal Interactive Maps (Oviatt 1997)
Page 7
![Page 8: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/8.jpg)
The Next Big Thing?
• New technologies (touch screens, GPS, accelerometer data, video-based recognition) will spur an evolution in multimodal interface design– Beyond mode choice to mode
integration
• Speak4itsm – only commercially available product we know of that performs multimodal integration at semantic level– Available for free on iPhone, iPad,
Touch
Page 8
![Page 9: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/9.jpg)
Multimodal interaction in Speak4It• Speak4it gesture inputs– point, line, area (drawn with finger)– when user hits ‘Speak/Draw’ button map display
becomes a drawing canvas
Page 9
![Page 10: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/10.jpg)
Multimodal integration provides more headaches for designers
• Problems:– More ‘dimensions’ of
context– Demands more focus
on “common ground” and aspects of knowledge that have already been grounded with users (Clark 1996)
![Page 11: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/11.jpg)
What is grounding?
• Mutual knowledge: Things that all parties in a conversation know, and know that other parties in the conversation also know – shared physically, linguistically, or via community
• When people introduce references, either verbally or by other means, they are grounding those references
• In dialogue, grounding helps to determine what people say, and what they don’t say– What we do or don’t say reveals a lot about aspects of context we
believe are already shared
![Page 12: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/12.jpg)
Grounding in telephony queries– Search queries are very basic dialogue• Single exchange of query & response
– Telcos have dealt with thesequeries for a long time….
Cable Car Pizza
What listing please?
Here’s that number….
![Page 13: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/13.jpg)
Grounding in telephony queries– 411 systems assumed an implicit grounded location because phones had a
fixed location (tied to area code)• To refer to another location, you called a different area code• The area code provided a source of mutual
knowledge about the grounded location in a query
Cable Car Pizza in San Francisco
What listing please?
Please call 415-555-1212
![Page 14: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/14.jpg)
Then phones lost their tethers (and their implicit grounding mechanisms)….
– With mobile phones, not as much shared knowledge about location – Location became “part of the conversation” again– Spoken query dialogue systems:
• Google-411, Bing-411, 800-Yellowpages• Phone apps• etc
San Francisco, California
What City and state?
What listing?
Cable Car Pizza
![Page 15: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/15.jpg)
Evidence of grounding problemsfound in Speak4it Logs
• Frequency of specific locations in queries: 18%• “police department in jessup maryland”• “office depot linden boulevard”
• Most are unlocated:• “gas station”• “saigon restaurant”
• Location grounding breaking down:• “Serendipity”• … followed shortly by• “Serendipity Dallas Texas”
Page 15
• Corrections:• “Starbucks Cape Girardeau”• … followed six minutes later by
• “Lowes”• .. then right away
• “Lowes Cape Girardeau”
![Page 16: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/16.jpg)
Location grounding sources in multimodal mobile search
Page 16
“italian restaurants”
PHYSICAL
User’s current location (GPS)
GUI
Location shown on map display
GESTURE
Where usertouched
VERBAL
“Sorry I could not find french restaurants in madison”
Place spoken in prior query
![Page 17: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/17.jpg)
Example
Page 17
“new york, new york”“pizza restaurants”✔ <scroll>
![Page 18: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/18.jpg)
Collecting grounding data in the wild• Gathered ground truth from users when they are “in the
wild”• Present users with a
“grounded location disambiguation” screen to collect user-reportedintentions
• Display to ~20% of unlocated queries
• Use these data to train a context model and to judge model comparisons
Page 18
![Page 19: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/19.jpg)
Page 19
PHYSICAL
MAP VIEW
VERBAL
GESTURESelectedColumn2
64.37%
38.04%
13.59%
69.29%
Selected grounded locations (relative to presentation)
![Page 20: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/20.jpg)
Page 20
• http://speak4it.com/screencast
![Page 21: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/21.jpg)
Speak4it multimodal architecture
Page 21
Gesture Recognition
ASR
NLU
LocationGrounding
InteractionManager
Multimodal Search Platform
Multimodal http data stream (speech, text, ink)
Results/Requests
Inktrace Gestures
Speech Platform
Geo-coderSearch
Audio
Parsedstring
Locationstring Lat/
LonQuery Results
Listings index
NL model
Geo index
FeaturesSalientLocation ASR
Gesture Recognition
SLM
![Page 22: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/22.jpg)
Page 22
Conclusions• Multimodal UIs will soon move from mode choice
to mode interaction• We’ll need richer context models to predict
grounding of locations and other references across modes, to align system actions with user expectations
• Mobile voice searchers don’t always consider their “GPS” location as the grounded one; location shown on the map is considered grounded 37% of the time
• User groundings from touch are highly salient
![Page 23: Multimodal Interaction in Speak4it](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816775550346895ddc6b27/html5/thumbnails/23.jpg)
Page 23
Acknowledgments• Thanks to Jay Lieske, Clarke Retzer, Brant
Vasilieff, Diamantino Caseiro, Junlan Feng, Srinivas Bangalore, Claude Noshpitz, Barbara Hollister, Remi Zajac, Mazin Gilbert, Barbara Hollister, and Linda Roberts for their contributions to Speak4it.