nlify: lightweight spoken natural language interfaces via exhaustive paraphrasing
Post on 10-May-2015
258 Views
Preview:
TRANSCRIPT
NLify Lightweight Spoken Natural Language Interfaces via Exhaus:ve Paraphrasing
Seungyeop Han U. of Washington Ma@hai Philipose, Yun-‐Cheng Ju MicrosoF
Speech-‐Based UIs are Here
Ubicomp 2013 2
Today Siri, …
Today Hey Glass, …
Tomorrow Hey Microwave, …
Keyphrases Don’t Scale
Ubicomp 2013 3
What :me is it?
…
Use Spoken Natural Language
App1
App2 Next bus to Sea@le
App3 Tomorrow’s weather
App50 … App26 When is the next mee:ng “What &me is the next mee:ng” …
Keyphrase Hell
Spoken Natural Language (SNL) Today: First-‐party Applica:ons
“Hey, Siri. Do you love me?”
Ubicomp 2013 4
• Personal assistant model • Large speech engine (20-‐600GB) • Experts mapping speech to a few domains
Speech Recogni:on
Language Processing
Text: “Hey Siri…” … “I’m not allowed, Seungyeop”
NLify: Scaling Spoken NL Interfaces
1st party app (e.g., Xbox, Siri) mul:ple PhDs, 10s of developers
3rd party app (e.g., intuit, spo:fy) 0 PhDs, 1-‐3 developers
end-‐user macro (e.g., iF@.com) 0 PhDs, 0 developers
10
10,000
10,000,000
# apps
Ubicomp 2013 5
Goal
Make
programming spoken natural language interfaces as easy and robust as
programming graphical user interfaces
Ubicomp 2013 6
Outline
• Mo:va:on / Goal • System Design • Demonstra:on • Evalua:on • Conclusion
Ubicomp 2013 7
Challenges
• Developers are not SNL experts
• Applica:ons are developed independently
• Cloud-‐based SNL does not scale as UI – UI capability must not rely on connec:vity – UI events must have minimal cost
Ubicomp 2013 8
Specifying GUIs
Ubicomp 2013 9 Intui:ve defini:on of UI handler linking to code
Specifying Spoken Keyphrase UIs
<CommandPrefix>Magic Memo</CommandPrefix> <Command Name="newMemo">
<ListenFor>Enter [a] [new] memo</ListenFor> <ListenFor>Make [a] [new] memo</ListenFor> <ListenFor>Start [a] [new] memo</ListenFor> <Feedback>Entering a new memo</Feedback> <Navigate Target=“/Newmemo.xaml”>
</Command> ...
How does natural language differ from keyphrases?
Ubicomp 2013 10
Difference 1: Local Varia:on
• Missing words
• Repeated words
• Re-‐arranged words
• New combina:ons of phrases
When is the next meeCng?
When is next mee:ng?
When is the next.. next mee:ng?
When the next mee:ng is?
What :me is the next mee:ng?
Ubicomp 2013 11
Difference 2: Paraphrases
show me the current :me what is the :me :me what is the current :me may i know the :me please give :me show me the :me show me the clock tell me what :me it is what is :me current :me tell what :me it is list the :me what :me
what :me it is now show current :me what :me please show :me what is the :me now current :me please say the :me find the current :me please what :me is it what is current :me what :me is it tell me :me current what's the :me tell current :me
what :me is it now what :me is it currently check :me the :me now tell me the current :me what's :me :me now tell me the :me can you please tell me what :me it is tell me current :me give me the :me :me please show me the :me now
Ubicomp 2013 12
Specifying SNL Systems
Ubicomp 2013 13
Speech Recogni:on
Language Processing
whanme() “what :me is it?”
Few rules, lots of data Use sta:s:cal language models that require li@le an:cipa:on of local noise
Use data-‐driven models that require li@le domain knowledge
Encode local varia:on in grammar
Encode domain knowledge on paraphrases in models e.g. CRFs
Lots of rules, liFle data
Exhaus:ve Paraphrasing by Automated Crowdsourcing
Ubicomp 2013 14
Examples from developers
Handler: whanme() Descrip:on: When you want to know the :me Examples: What :me is it now What’s the :me Tell me the :me
Handler: whanme() Descrip:on: When you want to know the :me Examples: What :me is it now What’s the :me Tell me the :me Current :me Find the current :me please Time now Give me :me …
following task,
descrip:on example
direc:ons
Automa:cally generated crowdsourcing
install :me
Seed Examples
dev :me
“Tell me when it’s @T=20 min …”
SAPI TFIDF + NN NLNo:fyEvent e
nlwidget
Compiling SNL Models .What is the date @d .Tell me the date @d …
amplify .What is the date @d .Tell me the date @d .What date is it @d .Give me the date @d .@d is what date …
Internet crowdsourcing
service
Amplified Examples
compile Nearest neighbormodel
SLM Sta:s:cal Models
run :me
Ubicomp 2013 15
install :me
dev :me
“Tell me when it’s @T=20 min …”
SAPI TFIDF + NN NLNo:fyEvent e
nlwidget
SNL Models for Mul:ple Apps
Amplified Examples
compile Nearest
neighbor model SLM Sta:s:cal Models
run :me
Ubicomp 2013 16
.What is the date @d
.Tell me the date @d
.What date is it @d
.Give me the date @d
.@d is what date …
Applica:on 1
• Apps developed separately => “late assembly” of models • Limited :me for learning at install :me => simple (e.g., NN) models • Users no longer say anything but what they have installed => “natural
language shortcut” mental model
.How much is @com
.Get me quote for @com
.What’s the price for @com …
Applica:on 2
…
Applica:on N
Outline
• Mo:va:on / Goal • System Design • Demo: SNL interfaces in 4 easy steps • Evalua:on • Conclusion
Ubicomp 2013 17
Ubicomp 2013 18
1. Add NLify DLL
2. Providing Examples
Ubicomp 2013 19
3. Wri:ng a Handler
Ubicomp 2013 20
4. Adding a GUI Element
Ubicomp 2013 21
Ubicomp 2013 22
Enjoy J
Outline
• Mo:va:on / Goal • System Design • Demonstra:on • Evalua:on • Conclusion
Ubicomp 2013 23
Evalua:on
• How good are SNL recogni:on rates? • How does performance scale with commands? • How do design decisions impact recogni:on? • How prac:cal is on-‐phone implementa:on? • What is the developer experience?
Ubicomp 2013 24
Evalua:on Dataset
Ubicomp 2013 25
Domain Intent & Slots Example
Clock FindTime() What :me is it?
FindDate(day) What’s the date today?
Calendar CheckNextMtg() What’s my next mee:ng?
Bus FindNextBus(route, dest) When is the next 20 to Sea@le?
Finance FindStockPrice(company) How much is MicrosoF stock?
CaculateTip(Money, NumPeople) How much is the :p for $20 for three people
CondiCon FindWeather(day) How is the weather tomorrow?
Contacts FindOfficeLoca:on(person) Where is the Janet Smith’s office?
FindGroup(person) Which group does Ma@hai work in?
… Across 27 different commands,
collected 1612 paraphrases, 3505 audio samples
Evalua:on Dataset
Ubicomp 2013 26
Seed 5 paraphrases/intent By authors Amplify via
Crowdsourcing $.03/paraphrase
Crowd ~60 paraphrases/intent By Crowd
Audio 130 u@erance/intent By 20 subjects
Asking “What would you say to the phone to do the described task” with an example
Training
Tes:ng
Overall Recogni:on Performance
Ubicomp 2013 27
• Absolute recogni:on rate is good (avg: 85%, std: 7%) • Significant rela:ve improvement from Seed (69%)
Performance Scales Well with Number of Commands
Ubicomp 2013 28
Design Decisions Impact Recogni:on Rates
Ubicomp 2013 29
• The more exhaus:ve paraphrasing the be@er:
• Sta:s:cal model improves recogni:on rate by 16% vs. determinis:c model
0% 20% 40% 60% 80%
100%
20% 40% 60% 80% 100%
RecogniCon
Rate
Training Set
Feasibility of Running on Mobiles
• NLify is compe::ve with a large vocabulary model
• Memory usage is acceptable: maximum memory for 27 intents was 32M
• Power consump:on very close to listening loop
Ubicomp 2013 30
Figure 5. Scaling with number of commands.
Figure 6. Incremental benefit from templates.
prising since both the SLM and TF-IDF algorithms that iden-tify intents compete across intents. Third, slot recognitiondoes not vary monotonically with number of competitors; infact the particular competitors seem to make a big difference,leading to high variance for each N . On closer examinationwe determined that even the identity of the competitors doesnot matter: when certain challenging functions (e.g., 11, 12and 19) are included, recognition rate for the subset plum-mets. Larger values of n will likely give a smoother averageline. Overall, since slot-recognition is performed determinis-tically bottom up, it does not compete at the language-modellevel with other commands.
Impact of NLify FeaturesNLify uses two main techniques to generalize from the seedsprovided by the developers to the variety of SNL. To cap-ture broad variation, it supports template amplification as perthe UHRS dataset. To support small local noise (e.g. wordsdropped in the speech engine), it advocates a statistical ap-proach even when the models are run locally on the phone (incontrast, e.g., to recent production systems [5]).
We saw earlier that using the Seed set instead of Seed +UHRS (where Seed has 5 templates per command and UHRSaverages 60) lowers recognition from 85% to 69%. ThusUHRS-added templates contribute significantly. To evaluatethe incremental value of templates, we measured recognitionrates when f = 20, 40, 60 and 80% of all templates wereused. We pick the templates arbitrarily for this experiment.The corresponding average recognition rates (across all func-tions) was 66, 75, 80 and 83%. Figure 6 shows the breakoutper function. Three factors stand out: recognition rates im-
(a) intent recognition (b) slots recognition
Figure 7. Benefit of statistical modeling.
Figure 8. Comparison to a large vocabulary model.
prove noticeably between the 80 and 100% configurations,indicating that rates have likely not topped out; improvementis spread across many functions, indicating that more tem-plates are broadly beneficial; and there is a big difference be-tween the 20% and the 80% mark. The last point indicatesthat even had the developer added an additional dozen seeds,crowdsourcing would still have been beneficial.
Given that templates may provide good coverage across para-phrases for a command, it is reasonable to ask whether adeterministic model that incorporates all these paraphraseswould perform comparably to a statistical one. Given tem-plate amplification, is a statistical model really necessary? Inthe spirit of the Windows Phone 8 Voice Command [5], wecreated a deterministic grammar for each intent. For robust-ness toward accidentally omitted words, we made the com-mon words {is, me, the, it, please, this, to, you, for, now}optional in every sentence. We compared recognition per-formance of this deterministic system with the SLM, bothtrained on the Seed + UHRS data. Figure 7 shows the resultsfor both intent and slot recognition. Two points are signifi-cant. First, statistical modeling does add a substantial boostfor both intent (16% incremental) and slot recognition (19%).Second, even though slots are parsed deterministically, theirrecognition rates improves substantially with SLMs. Thisis because deterministic parsing is all-or-nothing: the mostcommon failure mode by far is that the incoming sentencedoes not parse, affecting both slot and intent recognition rates.
The experiments thus far assumed that no query was garbage.In practice, users may speak out-of-grammar commands.SLify’s parallel garbage model architecture is set up to catchthese cases. Without the garbage model, the existing SLMwould still reject commands that are egregiously out-of-gra-
[Average] SLM: 85% LV: 80%
Developer Study w/ 5 Devs
Asked to add Nlify into the exis:ng programs
Ubicomp 2013 31
DescripCon Sample commands Original LOC
Time Taken
Control a night light “turn off the light” 200 30 mins
Get sen:ment on Twi@er “review this” 2000 30 mins
Query, control loca:on disclosure
“where is Alice?” 2800 40 mins
Query weather “weather tomorrow?” 3800 70 mins
Query bus service “when is next 545 to Sea@le?” 8300 3 days
(+) How well did NLify’s capabili:es match your needs? (-‐) Did the cost/benefit of Nlify scale? (-‐) How long do you think you can afford to wait crowdsourcing
Conclusions
It is feasible to build mobile SNL systems, where: • Developers are not SNL experts • Applica:ons are developed independently • All UI processing happens on the phone Fast, compact, automaCcally generated models enabled by exhausCve paraphrasing are the key.
Ubicomp 2013 32
For Data and Code
Check Ma@hai’s Homepage. h@p://research.microsoF.com/en-‐us/people/ma@haip/
Or e-‐mail the authors On/aVer October 1.
Ubicomp 2013 33
top related