echoes from audrey: voice assistants--the sound of machine intelligence
Post on 29-Jan-2018
14 Views
Preview:
TRANSCRIPT
©2017 Fivesight Research LLC, www.fivesightresearch.com
Echoes From Audrey Voice Assistants: The Sound of Machine Intelligence Fivesight Research LLC PO Box 2341 Red Bank, NJ 07701 www.fivesightresearch.com
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 1
Table of Contents
What is a Voice Assistant? ........................................................................................... 4
Voice Assistant Solution Stack ............................................................................................... 5
Speech Recognition: A Brief History ........................................................................... 5
The Pre-Modern Era: 1952—1970 ..................................................................................... 5
ARPA Speech Understanding Research Program .............................................................. 7
Hidden Markov Models ....................................................................................................... 7
Siri ...................................................................................................................................... 8
AI and Artificial Neural Networks ........................................................................................ 9
Voice Assistant Market Landscape ........................................................................... 10
Siri.........................................................................................................................................10
Background .......................................................................................................................10
Siri Ecosystem ..................................................................................................................11
New Features ....................................................................................................................11
Smart Speaker--HomePod ................................................................................................12
Google Assistant ...................................................................................................................12
Background .......................................................................................................................12
Google Assistant Ecosystem .............................................................................................13
New Features ....................................................................................................................14
Smart Speaker—Google Home Products ..........................................................................16
Amazon Alexa .......................................................................................................................16
Background .......................................................................................................................16
Alexa Ecosystem ...............................................................................................................17
New Features ....................................................................................................................18
Smart Speaker—Echo Products ........................................................................................19
The Others—Microsoft, Facebook, Samsung, Baidu .............................................................20
Voice Assistant Market Adoption .............................................................................. 22
Embedded Segment: Smartphones.......................................................................................22
Smart Speakers ....................................................................................................................25
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 2
Table of Contents for Figures and Tables
Figure 1: Voice Assistant Market Landscape ................................................................ 10
Figure 2: Echo Product Line .......................................................................................... 20
Figure 3: Respondents Using a Voice Assistant ........................................................... 23
Figure 4: Primary Smartphone Search Engine .............................................................. 24
Figure 5: Voice Assistant Usage ................................................................................... 25
Figure 5: Smart Speaker Shipments ............................................................................. 25
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 3
About Fivesight Research
Fivesight Research is a boutique research firm specializing in search, voice assistants
and the emerging AI technologies breaking apart the classic “plain old search box”. We
track and analyze the landscape of search through dedicated, ongoing coverage of
Google and the markets in which it operates.
For questions or comments on this report or Fivesight’s broader research program
please contact Joe Buzzanga at j.buzzanga@fivesightresearch.com.
Introduction
In 1952 Bell Labs researchers created the first speech recognition system. They named
it Audrey, and it was capable of understanding 10 spoken digits (“0” to “9”), separated
by pauses and uttered by a designated speaker. Fast forward 59 years to 2011. That
was the year Apple launched Siri on the iPhone, boldly introducing a new type of
intelligent voice application to a mass market. Three years later Amazon shocked the
tech industry by preemptively unveiling Echo, the first voice activated home appliance.
Today, voice assistants have crossed the chasm into the mass market. They are
embedded in iOS and Android devices and power the smart speaker product category
pioneered by Amazon. They have crossed a technology chasm as well: the technology
finally works well enough to be useful. Voice assistants aren’t intelligent in any
meaningful sense, but the speech recognition works well and continues to improve.
Meanwhile, artificial intelligence (AI) research progresses at an accelerating rate with
innovations in algorithms, models and silicon promising to increase the voice assistant’s
cognitive endowment.
It has been a long journey but the voice assistant is just the beginning of an AI
revolution that promises—some might say threatens—to transform industries, markets,
and society as a whole. This white paper analyzes the current voice assistant market
and technology landscape while providing historical context around the evolution of the
core speech recognition technology.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 4
What is a Voice Assistant?
Siri, Google Assistant, and Alexa are examples of products that fall under the somewhat
vaguely defined category of voice assistants. We propose a more formal definition
below, isolating the essential differentiating attributes of these emerging products:
• AI—powered: artificial intelligence technology forms the basis of a voice
assistant, allowing it to simulate aspects of human intelligence including voice
recognition and synthesis, language understanding, and world knowledge.
There is no prescription regarding the type of AI, but most modern systems are
hybrids incorporating machine learning, knowledge graphs and other
techniques
• General purpose—a voice assistant aspires to provide a wide range of
services. It is general purpose and open-ended, designed to answer questions
and provide services across many domains and tasks. This helps distinguish it
from a chatbot which is typically tuned and tailored for one function (airline
reservation, customer service..)
• Software application—It’s important to distinguish the voice assistant from any
specific hardware implementation or device. The voice assistant is a software
application that can be embodied in lots of different devices: smartphones,
wearables, smart speakers, automobiles etc.,
• Simulates intelligence—Ideally, a voice assistant would pass an unrestricted
Turing test. A human interrogator could ask it anything at all and in each case
would not be able to tell whether the response came from human or machine.
This remains a distant goal. Today a voice assistant succeeds by simulating a
narrow set of capabilities that give the appearance of cognition. For example:
o Conversational ability: voice recognition, speech synthesis and natural
language processing (NLP).
o Knowledge of facts about the world, broadly construed, including but not
limited to people, places and things.
o Recognition of usage patterns allowing it to make informed “guesses”
that anticipate my needs based on apps I’ve been using, places I’ve
been etc.,
o Personalization: tailoring answers and actions to the user.
A voice assistant is an AI—powered, general purpose
software application that simulates intelligence through
conversational (vocal) interaction, factual knowledge,
predictive abilities and personalization
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 5
o Prediction: proactively suggesting answers and actions before the user
asks
Voice Assistant Solution Stack
How does the voice assistant perform its magic? It starts with data. The web is one
source, but not the only one. Personal data (contacts, calendar appointments, etc.),
location data, and other on-device sensor information are a fundamental data source as
well. This device specific and personal profile data is combined with the web to create
the raw material for the AI technology, which can be cloud based, device based or a
hybrid of the two.
The AI layer exploits the data to build its
“intelligence”, with components such as world
knowledge (knowledge of facts), speech
recognition and NLP. The core speech
recognition and voice control operate in two
directions: Speech to text, (the assistant
understands spoken utterances) and text to
speech (the assistant responds with a human-
like voice using speech synthesis).
APIs and SDKs extend the functionality of the
voice assistant, making it a platform for 3rd party
apps.
The HW/Interface layer adapts the voice
assistant to diverse device types.
Services and apps are the actual functions exposed to the user and may be developed
by the voice assistant vendor or ecosystem partners. For example, “play a song”, “set a
calendar appointment”, “turn on the lights” etc.,
Speech Recognition: A Brief History
The Pre-Modern Era: 1952—1970
Early researchers were faced with the formidable challenge of creating speech
recognition systems before the advent of sufficiently powerful and widely deployed
general purpose computing technology. The first documented speech recognizer,
excluding novelties and toys, was created by Bell Labs researchers in 1952. They called
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 6
it Audrey (Automatic Digit Recognizer) and it was capable of recognizing 10 spoken
digits (“0” to “9”), separated by pauses and uttered by a designated speaker. In other
words, it was speaker dependent, limited to a fixed, predefined vocabulary and able to
recognize only numbers spoken discretely with deliberate boundaries formed by distinct
pauses.
Bell Labs wasn’t the only industrial research institution experimenting with speech
recognition. The other U.S. R&D colossus, IBM, was also keenly interested and pushed
the state of the art forward in this early period and up to contemporary times. An early
example was “Shoebox” , a machine that
could do simple math calculations via
voice commands. The system, which was
demonstrated on TV and at the 1962
Seattle World’s fair, recognized ten digits
and six control words.
Audrey, Shoebox and other early
systems were impressive achievements
for their time, but they couldn’t scale
technically or economically and never found commercial applications. Although there
were multiple limitations to all of these pioneering efforts, three stood out and wouldn’t
be solved for decades:
1. Limited Vocabulary—restricted to a fixed, predefined vocabulary, at most ~100
words
2. Word isolation—capable of recognizing only isolated words rather than
continuous speech
3. Speaker dependence—limited to working with a speaker(s) specifically trained
with the system
Speech recognition research flourished throughout the 1960s, but there were those who
were critical of its aims, methods and apparent lack of progress. The most prominent
was the eminent Bell Labs scientist, J.R. Pierce, who issued a scathing critique in 1969
in a letter to the Journal of Acoustical Society of America entitled “Whither Speech
Recognition”. Pierce questioned the whole enterprise of speech recognition, suggesting
that the core problems were unsolvable and that the research was unscientific and
irresponsibly conducted without serious thought as to its purpose and likelihood of
success. His letter provides a fitting, if disappointing, end to the 1960s.
IBM “Shoebox”
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 7
ARPA Speech Understanding Research Program
In 1971 work on speech recognition was given a new impetus thanks to the Speech
Understanding Research (SUR) program instituted by the U.S. Advanced Research
Projects Agency (ARPA). The SUR project funded a $15 million, five year program to
develop a large vocabulary, continuous speech recognition system meeting a set of
detailed performance goals recommended by a study group chaired by leading AI
researcher and Carnegie Institute of Technology professor Allen Newell. As a side note,
a National Academy of Sciences committee headed by Pierce strongly opposed the
program.
The program ended in 1976 and four prototypes were demonstrated. Only one, the
Harpy system from Carnegie Mellon, managed to meet the original performance
objectives.1 It was able to process connected speech from multiple speakers with 95%
accuracy based on a large (1,011 words) vocabulary.
Despite the success of Harpy, the SUR program was generally taken as a vindication of
Pierce’s critique and further evidence of the immaturity of the speech recognition field.
The government declined to extend the program.
Dr. Raj Reddy, a prominent speech recognition scientist and leader of the Carnegie
Mellon SUR team, offered this assessment of the field in a review article published
contemporaneously with the conclusion of the SUR program:
We are still far from being able to handle relatively unrestricted dialogs from a large population of speakers in uncontrolled environments. Many more years of intensive research seems necessary to achieve such a goal.2
While his prediction was correct, today it is clear that the methods and insights created
by the SUR funded systems were important advances that would prove to be
foundational for most contemporary speech recognition solutions.
Hidden Markov Models
Harpy was itself a partial synthesis of two other Carnegie Mellon speech recognition
systems: Hearsay-1 and Dragon. The latter, designed by Dr. James Baker, pioneered
1 Harpy actually exceeded two of the target metrics, one dealing with error rates and the other specifying
compute cycle budget. 2 Reddy, Dabbala Rajagopal. "Speech recognition by machine: A review." Proceedings of the IEEE 64.4
(1976): p.528.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 8
the use of Hidden Markov Models (HMM) in speech recognition. The key Dragon
innovation was to reframe the speech recognition task wholly as a mathematical,
probabilistic problem based on HMM:
The most significant feature of the Dragon system, as compared to most other current speech recognition systems, is its almost total lack of speech-dependent heuristic knowledge. Dragon treats speech recognition as a mathematical computation problem rather than as an artificial intelligence problem.3
The HMM approach would eventually come to dominate modern speech recognition
systems until being supplanted by artificial neural network (ANN) AI technology.
Dr. Baker and his wife founded Dragon Systems in 1982 to commercialize the Dragon
technology. In that same year, the futurist, inventor and serial entrepeneur, Ray
Kurzweil, started Kurzweil Applied Intelligence with the goal of creating a voice activated
word processor. Meanwhile, IBM continued its research and unveiled the experimental
PC-based Tangora speech recognizer.
These efforts led to the first wave of consumer, mass market speech recognition
products for the PC market in the early 1990s. These products included DragonDictate
(1990), IBM Personal Dictation System (1993) and Kurzweil Voice (1994). By the end of
the decade large vocabulary, continuous speech recognition programs were available
from these and other vendors.
There were also efforts to develop business applications of speech recognition, most
prominently in the telecommunications industry. AT&T introduced several systems for
customer care and operator services.
Siri
Siri brought speech recognition to the mass market when Apple included it on the
iPhone. The story of Siri has its roots in a 2003 DARPA project named PAL (Personal
Assistant that Learns). PAL was funded with a 5 year, $150 million budget and was
intended to develop technologies to help computers interact with humans in a more
powerful and natural way using reasoning, learning and other cognitive abilities. PAL
3 Lowerre, Bruce T. The HARPY speech recognition system. Carnegie-Mellon University, Pittsburgh, PA
Department of Computer Science, 1976.p.16
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 9
ultimately led to the creation of Siri and in this regard played a central role in
commercializing speech recognition technology.
SRI International, an independent, non-profit R&D organization, led a team under the
PAL program and organized its efforts in a project it called “CALO” for Cognitive
Assistant that Learns and Organizes. CALO spawned a number of commercial
applications, most famously, Siri which was spun out as an independent unit in 2007.
Apple acquired the company in 2010 and in 2011 introduced Siri as an integral part of
iOS in the iPhone 4s.
The original Siri speech engine was licensed from Nuance, a company specializing in
speech technology and formed in 2005 as a result of multiple corporate acquisitions
including the original Dragon Systems.
AI and Artificial Neural Networks
Today, roughly 40 years after Dr. Reddy frankly stated the limitations of speech
recognition technology, those limitations have been largely overcome. Modern voice
recognition systems can competently handle unrestricted dialogs from a large
population of speakers in uncontrolled environments. This competence stems from the
surprising success of an AI technology that sparked breakthroughs in a range of difficult
computer science applications such as handwriting recognition and image classification.
In the early 2000s researchers began to apply Artificial Neural Networks, a type of AI
technology, to speech recognition. In a seminal 2009 paper Geoffrey Hinton et.al.,
showed that a deep neural network—a multi-layered form of ANN—outperformed
conventional statistical techniques on a standard speech recognition benchmark test
set. Further work improved results even more and by 2012 Google had publicly stated
that it had converted to neural network technology for its voice search on Android.
Recent research shows deep neural network speech recognition approaching human
level performance, although there is some debate in the scientific community about how
to best measure performance on speech recognition tasks4. What is not in doubt is that
neural network technology has strikingly improved speech recognition and is catalyzing
the adoption of entirely new types of voice controlled devices and services. In the case
4 In an August 2017 blog entry Microsoft unequivocally claims to have matched the currently accepted human word error rate of 5.1% in its speech recognition system.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 10
of voice assistants, all of the major vendors—Google, Amazon, Apple, Microsoft—have
adopted the technology to drive speech recognition.
Voice Assistant Market Landscape
Figure 1 summarizes the product offerings of the major voice assistant vendors. “Vision”
refers to the addition of computer vision features inside the voice assistant, a trend
emerging in 2017.
Figure 1: Voice Assistant Market Landscape
Sources: Company information, VoiceBot https://www.voicebot.ai/
Siri
Background
Apple was first to market thanks to its acquisition of Siri in 2010. It hasn’t capitalized on
this early mover advantage and today Siri isn’t demonstrably smarter or more capable
than competing voice assistants. But Siri enjoys strong, though not universally positive,
brand recognition thanks to Apple’s marketing and its integration with iOS. It has the
benefit of a massive, captive installed base on iOS: at Apple’s most recent developer
conference the company claimed that Siri had 375 million active monthly devices. The
Voice Assistant Landscape
ProductInitial
ReleaseDevices Vision 3rd Party Support Notes
Apple SiriOctober,
2011
iPhone, iPad, AppleWatch,
AppleTV, CarPlay,
MacOS, HomePod, Smart
Home
No SiriKit
• Siri was an acquisition
• Apple Home smart speaker launch
delayed to 2018
GoogleGoogle
AssistantMay, 2016
Android, AndroidWear,
Android TV, Android Auto,
Google Home, Pixelbook,
Smart Home, iPhone
Yes,
Lens
• Actions on Google
• Google Assistant
SDK
• 468 Actions
• Google Lens integration in Google
Assistant rolling out to Pixel phones
starting in late November
Microsoft Cortana April, 2014
iOS, Android, Windows,
xBox, Harmon Kardon
smart speaker
No• Skills Kit
• 174 Skills
• Harmon Kardon Invoke smart speaker
launched Oct. 22, 2017
Amazon Alexa Nov. 2014
Echo Devices, selected
smartphones from
Motorola, Huawei and
HTC, TV, Auto, Smart
Home, Cloud Cam
Yes
• Skills Kit
• Alexa Voice
Service
• 17,650 Skills
• Alexa inside Amazon App on iOS and
Android
• Vision/image recognition is part of
Amazon App
Facebook M
Aug. 2015
GA April,
2017
Inside Messenger on
Android, iOSNo No
• Not voice activated
• Contextual suggestions
Samsung Bixby July, 2017
Samsung Galaxy S8,
Wearables, TV, Auto,
Smart Home
Yes,
Lens
Bixby SDK (private
beta)
• S Voice, released in May 2012, was a
precursor.
• Bixby built on technology from Viv
acquisition.
Baidu DuerOS Sept. 2015
Android & iOS smartphone
apps, Raven H smart
speaker
NoDuerOS Open
Platform
• Smart speaker based on Raven Tech
acquisition
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 11
number, though, is virtually meaningless since “active monthly device” wasn’t defined,
nor was the time period for the metric.
We calculate an alternative view of Siri’s market presence by estimating the iPhone
installed base. Using industry statistics, we estimate that in 2016 there were roughly
600 million iOS devices globally, based on a 15% iPhone share of a global installed
base of 4 billion smartphones. All of these have ready access to Siri.
Siri Ecosystem
In 2016 Apple opened up Siri to 3rd party developers with the launch of SiriKit, its
framework for adding functionality to iOS apps. SiriKit supports the iPhone, Apple
Watch and with iOS 11.2, Apple’s forthcoming smart speaker, HomePod. However the
company restricts developers to a set of Apple-defined use cases and domains, such as
ride hailing, payments and a few others. In addition, Apple only permits app developers
to control specific actions within an existing iOS app. And the app integration occurs on
top of Siri in Apple controlled devices and environments. In other words, Apple doesn’t
permit developers to integrate Siri into their own hardware.
Apple’s stance with Siri is consistent with its DNA. Its strength is controlling the user
experience through tight integration of hardware, software and services. It wants
independent software vendors to enhance Siri, but only within the confines of Apple’s
business objectives and a user experience it controls. The result is that the ecosystem
around Siri is sparse at the moment, especially in comparison with Amazon and Google,
neither of whom place Apple-like restrictions on their ecosystem partners.
New Features
Although Siri has an immense captive installed base, given its perceived flaws it isn’t a
fait accompli that Apple can exploit it as the voice assistant evolves into what may be
the next major computing platform, essentially the vehicle that finally ushers in AI on a
mass market scale. Competition in the voice assistant space is intense and Apple has
been aggressively moving to improve Siri with the latest machine learning technologies
via acquisitions and hiring. Some of this technology is evident in new features
introduced for Siri on iOS 11:
• Speech synthesis for Siri’s voice using deep neural networks. Siri’s voice is more
natural and expressive
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 12
• On-device learning for predictive suggestions: Siri tracks user actions to provide
more personal and contextual suggestions, securely, privately and synched
across devices
• Beta of instant translation for five language pairs
• Increased developer support on SiriKit
Siri will also get a boost from custom silicon in the form of the iPhone X A11 Bionic
neural engine. Although the processor will initially function to optimize machine learning
algorithms for face recognition in the $1,000 iPhone X, Apple will undoubtedly employ it
to accelerate the AI technology powering Siri in the future
Smart Speaker--HomePod
Siri currently has no presence in the home, but that will change with the launch of Apple
HomePod. Originally slated for a December launch, the company issued a statement in
November saying it needed more time. Now the timeframe is early 2018.
Apple has chosen to position HomePod as a premium priced audio speaker rather than
an AI-first, intelligent home device. The value proposition tellingly centers on superior
sound quality and integration with Apple Music rather than the power and advanced
intelligence of Siri.
Apple’s smart speaker comes more than two years after Amazon created the category
with Echo and one year after Google launched Google Home. Being late to market is
nothing new for Apple and the company has historically relied on delivering a superior
user experience to win market share, even if competitors were there first. Whether that
will be sufficient in this case is open to question, especially considering the rapid pace
of product innovation by Amazon and Google. Both companies have already iterated on
the initial smart speaker, building out voice assistant powered product families for the
home in different form factors and at different price points.
Google Assistant
Background
Google has been offering voice search and voice assistant-like products (ie., Google
Now, Allo, Now on Tap) since 2011. In May 2016 it finally put all its wood behind one
arrow with Google Assistant, which was launched as part of Google Home, the
company’s answer to the Amazon Echo. In October 2016 it became available on
Google’s Pixel smartphones and today the Assistant is the foundation for the company’s
expanding line of “made by Google” branded consumer hardware. In 2017 the company
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 13
expanded the reach of Google Assistant to the broader ecosystem of Android device
makers as well as iOS smartphones.
Google regards voice assistants as the next major computing platform shift, powering
everything from smart speakers to intelligent vehicles. We can expect to see it double
down on investments in product development around Google Assistant to exploit its
industry leading machine learning technology and to compete against Alexa, Siri, and
other contenders.
Google Assistant Ecosystem
In 2017 Google successfully increased the Assistant footprint beyond Google branded
products to a wide variety of products from 3rd party manufacturers including:
• Android smartphones running Android 6.0 Marshmallow or Android 7.0 Nougat
• Android TV: Google Assistant embedded in NVIDIA Shield, along with all Android
TVs in the U.S. running Android 6.0 Marshmallow or Android 7.0 Nougat
• Android Wear 2.0 Watches: LG Watch Style and LG Watch Sport—both
designed in collaboration with Google.
• iPhone: on iOS 9.1+
• Bose QC 35 ll Headphone: via Bluetooth pairing with Android or iOS smartphone
• Speakers: including products from JBL, Panasonic, Sony and more
Overall, the company says that the Assistant is available on 100 million Android
devices. That number may seem large, but it is only a small fraction of the Android
installed base and highlights the challenge Google faces with its semi-open, horizontal
business model.
With the exception of Google’s own branded hardware, the Assistant’s market
penetration is limited by the fragmentation of the Android ecosystem: Android
Marshmallow 6.0, Nougat 7.0 and Oreo (the latest version of Android) represent 50% of
Android phones (based on Playstore visits) according to one estimate. Although this
portion of the Android ecosystem is capable of running Google Assistant, not all of the
OEMs are aligned with Google’s business objectives. The biggest Android OEM,
Samsung, is frankly unaligned and is pushing its own competing voice assistant, Bixby,
on its newest smartphones.
This Android ecosystem fragmentation explains why Google has gotten serious about
hardware and launched its own line of branded “made by Google” devices. Google is
determined to intercept the AI platform shift in computing and it cannot rely solely on an
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 14
ecosystem of ambivalent partners. It will build its own devices, integrating AI, hardware
and software. If there was any doubt about its seriousness, the acquisition of 2,000 HTC
hardware engineers, should put them to rest.
But 3rd party developers are still key for Google Assistant to scale into a real platform as
computing shifts to an AI first technology foundation. Developers can build on Google
Assistant in two ways:
• Actions on Google—lets developers build apps for the Google Assistant
• Google Assistant SDK—lets developers integrate Google Assistant into their own
devices including functionality such as hotword detection, voice control, and
natural language understanding.
So far, industry reports suggest Google is a distant second to Amazon when it comes to
developer support. According to VoiceBot, there were 468 Actions for Google Assistant
compared to 17,650 Skills for Alexa as of July 2017.
New Features
In 2017 Google announced significant new features for Google Assistant end users and
developers including:
Google Assistant End User Features
• Google Lens: point the smartphone camera at an object to trigger relevant
actions and information. Integrated in the Google Assistant on Pixel phones.
• Improved speech synthesis for Google Assistant’s voice using DeepMind’s
WaveNet technology.
• Proactive notifications on Google Home.
• Routines: expands preset routines giving users more options and control to
trigger a series of actions using a single command.
• Multi-user support aka Voice Match: Google Assistant on Google Home can
recognize up to six individuals based on their voice and tailor responses
accordingly.
• Keyboard input: Google Assistant supports typed queries.
• Bluetooth audio streaming.
• Hands free voice calling on Google Home to any landline or wireless number in
the U.S. and Canada.
• Voice broadcast: users can broadcast a message to all Google Assistant enabled
speakers.
• Visual Google Assistant responses from TVs with Chromecast.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 15
• Shop with Walmart: partnership with Walmart for personalized shopping using
Google Assistant.
• Shopping on Google Assistant: Google Assistant users can shop from
participating Google Express retailers.
• Schedule Calendar appointments and create reminders.
• Expanded music and entertainment support.
• International expansion: Google Home in the U.K., Canada (English and French),
Australia, Germany, France and Japan. The Assistant on eligible Android phones
and iPhones is also available in Brazilian Portuguese, Japanese, Korean with
Italian, Spanish (in Mexico and Spain) and Singaporean English promised in the
near future.
Google Assistant Developer Tools
• Actions on Google launched on Android and iOS smartphones.
• Google Assistant SDK: preview launched in April; updated in May.
• Multilingual support: Apps can be created in German, French, Japanese, Korean,
Spanish, Italian, Portuguese, French and English in Canada.
• Apps launched in the UK and Australia.
• Pixelbook support: Apps will run on Pixelbooks.
• AIY Voice Kit: do-it yourself kit for developers to build a standalone voice
recognition system using Google Assistant.
• Revamped App directory to improve discovery.
• App device handoff: start an interaction on a smart speaker and handoff to the
phone.
• Personalization: collect user preferences to personalize app interactions.
• Features promoting app re-engagement including updates and push notifications.
• Family friendly apps: developers can have apps certified as “family friendly”.
• Templates for app creation: coding not required.
• Transactions: support for transactional apps on Google Assistant on phones.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 16
Smart Speaker—Google Home Products
Google, like Amazon, is
developing a family of smart
home speakers in different form
factors and price points. Google
Home was the first and launched
in November 2016. Nearly one
year later in October 2017 it
added two others: Google Home
Mini and Google Home Max. The
Mini ($49) is an entry level device and is a more or less complete imitation of Amazon’s
Dot, which was introduced more than one year earlier. Google Home Max ($399) is a
preemptive strike against Apple’s upcoming HomePod. Like the HomePod, the Max is a
premium priced smart speaker with a value proposition built around a superior audio
experience. Google Home ($129) was the company’s original entry in the smart speaker
category. Although no product has been announced, Google is rumored to be working
on a Home device with a screen to compete with Amazon’s Echo Show.
Google claims that Google Home works with more than 1,000 smart home devices from
more than 150 popular
brands. Many of these
devices can be
controlled by Alexa as
well, so 3rd party device
support isn’t a strong
competitive
differentiator.
Amazon Alexa
Background
In November 2014 Amazon launched the Echo with Alexa as its integrated voice
assistant. The device was the first of its kind and established the smart speaker product
category. Few in the tech world would have expected Amazon to succeed with a mass
market consumer device, especially after its debacle with the Fire smartphone. Yet
three years after its introduction, the Echo dominates one of the most hotly contested
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 17
and fast growing consumer electronic product segments. Meanwhile, Amazon has
continued to build on its first to market advantage with a relentless pace of product
innovation and aggressive pricing.
Today the company has the most extensive smart speaker line up in the industry with a
family of devices at different price points, in different form factors, and tailored for
different use cases. And it is a good thing since Alexa isn’t widely distributed on
smartphones. Notable exceptions include the Huawei Mate 9 and the Motorola Moto X4
both of which offer Alexa integration. Another Android vendor, HTC, introduced an
Alexa-enabled smartphone, the U11 in 2017, but Alexa isn’t pre-installed; the user has
to download it from the Google PlayStore. Alexa is accessible inside the Amazon
shopping app as well. Otherwise, Amazon offers both iOS and Android Alexa apps, but
they don’t act as full-fledged voice assistants. They mainly control Echo devices, which
are presumed to be present.
Alexa Ecosystem
Independent developers can participate in the Alexa ecosystem in a number of ways:
• Alexa Skills Kit: The Alexa Skills Kit (ASK) is a collection of self-service APIs,
tools, documentation, and code samples for developers. Using ASK, developers
can add capabilities to Alexa by creating “skills”. Amazon claims that more than
27,000 skills have been created to date.
• AVS (Alexa Voice Service): Device manufacturers can add intelligent voice
control to any connected product that has a microphone and speaker. In 2017
Amazon launched an AVS SDK to simplify and accelerate the creation of voice
enabled ecosystem products.
• Alexa Smart Home: Smart home device manufacturers can add Alexa voice
control to their products using the Smart Home Skills API.
• Alexa Gadgets: Announced in 2017, Gadgets are a new category of connected
products and developer tools that turn a compatible Echo device into a hub for
interactive play. Developers can build skills for Gadgets using the Alexa Gadgets
Skills API. They also have the option of creating their own Gadgets using the
Alexa Gadgets SDK.
The developer community around Alexa is by far the largest of any voice assistant
reflecting Amazon’s first mover advantage combined with the company’s rich set of
development tools and offerings.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 18
New Features
Amazon introduced new Alexa end user and developer features in 2017 as well as new
Echo devices.
Alexa End User Features
• Alexa Routines: customers can trigger a series of actions using a single voice command of their choice. Google announced a similar feature in October.
• Alexa Groups: allows customers to combine discrete smart home appliances in a group.
• Device View/Control: customers can see and control smart home appliances in
the Alexa app.
• Custom Alexa Lists: allows users to create and organize lists of their own choice.
• Alexa Calling and Messaging: Customers can place and receive calls using Echo
devices or the Alexa app.
• Alexa/Cortana Interworking: Alexa and Cortana can “talk” to each other giving
customers access to the respective strengths of each voice assistant.
• Alexa in Amazon Music App: embeds Alexa inside Amazon music app on iOS
and Android.
Smart Home Developer Tools
• Alexa Smart Home Skill API: updated API for creating Alexa smart home enabled appliances such as thermostats, lights, cameras, etc.
o Lock Control and Query: allows developers to voice control devices with
locks.
o Thermostat Query: simplifies development of temperature controlled
devices.
o Tunable Lighting Control: lets developers voice control color changing
lights or tunable white lights.
o Smart Home Camera Support: allows developers to show live streams
from smart home cameras on Echo Show.
o Entertainment Controls: allows Alexa to control various cloud connected
devices such as TVs, AV receivers.
Alexa Voice Service (AVS) Developer Tools
• AVS Device SDK: accelerates and simplifies development of Alexa voice enabled products.
• AVS Notifications: allows Alexa to proactively deliver content.
Display and Video Developer Tools
• Alexa Skills Kit Updates: lets developers update skills to take advantage of the display and video interfaces on the Echo Show.
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 19
• Display Cards: adds visuals to an Alexa enabled product complementing Alexa
voice responses for music, weather, calendar and more.
• Video Skill API: supports development of Alexa skills for video devices, content
and services.
Other Developer Tools
• Cloud-Based Wake Word Verification: improves accuracy of wake word detection by adding cloud verification to initial device wake word detection.
• Alexa Gadgets: developer tools for creating gaming and entertainment experiences.
• List Events: notifies a skill when a user makes a change to an Alexa list.
• Alexa 7-Mic Far-Field Development Kit: hardware-based reference design
enabling device manufacturers to build voice enabled products using Amazon’s
far-field mic and voice processing technology.
Smart Speaker—Echo Products
Amazon is rapidly versioning Alexa-powered devices, building out a family of products.
In September, the company announced what it called the “next generation" of Echo
devices:
• Echo (2nd generation): A lower-priced refresh of the original Echo with enhanced
speakers and Dolby sound.
• Echo Plus: includes a built-in Zigbee hub for home automation.
• Echo Spot: compact Echo with display; shipping Dec. 19.
• Echo Buttons: first product in a new category of connected products called “Alexa
Gadgets”. These are dedicated hardware products for playing games through an
Echo device.
Products announced earlier in 2017 include:
• Echo Look, announced April 26, includes a hands-free, depth sensing camera
and “style assistant”.
• Echo Show, featuring a 7 inch screen, was announced in May and began
shipping June 28.
• Dash Wand with Alexa, is a voice or barcode activated device for ordering
Amazon products, answering simple queries, and controlling home appliances. It
was launched in June.
These devices join the original Echo, Amazon Dot, and Amazon Tap family of Alexa-
powered devices. Not all of these will be successful, but the cadence of Amazon’s
product development and release cycle is so far unmatched by Google (and Apple, for
that matter).
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 20
The Echo smart speaker line up is summarized below.
Figure 2: Echo Product Line
Note: Amazon Tap ($129.99) is also Alexa controlled but isn’t part of the Echo brand. Echo devices are
positioned as smart home devices. Tap, though it has Alexa smarts, is a portable bluetooth speaker and
so is kept separate from the Echo line.
The Others—Microsoft, Facebook, Samsung, Baidu
Microsoft, Samsung, and Facebook are second tier players at the moment, while Baidu
isn’t yet a major player outside of its home market in China.
Microsoft
Microsoft is handicapped by the lack of a captive mobile platform for Cortana. On the
other hand, it has deeply integrated Cortana into Windows 10, making it an integral part
of the desktop environment. The open question is whether the desktop computing user
experience is really enhanced by a voice assistant. Microsoft believes it is, and during
its Q42016 earnings call it reported that the Cortana search box had over 100 million
active monthly users with 8 billion questions asked to date. Another data point was
offered at the company’s 2017 Build conference: it reported that Cortana had more than
141 million unique users on a base of 500 million Windows 10 devices.
In the smart speaker segment, Microsoft has elected to work with partners rather than
build its own branded device. Its first product, the Harmon Kardon Invoke ($199.95)
launched in October. It’s a me-too product, with an attempt at differentiation based on
Harmon Kardon’s audio pedigree and Cortana’s integration with the Microsoft
ecosystem. Microsoft will need to quickly find a way to expand on this initial product,
either with Harmon Kardon or with other partners if it is to be successful against the
more evolved smart speaker offerings from Amazon and Google.
Echo Product Line
Echo Connect Echo Dot Echo Echo Spot Echo Plus Echo Look Echo Show
Price $34.99 $49.99 $99.99 $129.99 $149.99 $199.99 $229.99
Description
Accessory that
converts an
Echo into a
voice controlled
speakerphone
Entry level
smart home
device
2nd generation of
flagship device
with enhanced
audio
Compact
Echo with
screen
Echo with built
in Zigbee hub
for connecting
smart home
devices
Hands-free
camera and
”style assistant”
Large screen Echo
with dual 2.0”
speakers
optimized for
higher quality
audio and video
Display No No No Yes, 2.5” No No Yes, 7”
Integrated
HubNo No No No Yes No No
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 21
Strictly speaking, Facebook isn’t a player in the voice assistant space. It does have a
feature called “M” inside Messenger that acts like an assistant but the original version
was text driven and primarily powered by humans rather than AI. In April 2017 it
announced “Suggestions” from M which does rely entirely on AI to offer helpful actions
based on its interpretation of conversations inside Messenger.
Facebook has built up a world class AI team and there is every reason to suppose that
it will be able to launch a voice assistant in the near future matching or surpassing Siri,
Google Assistant and Alexa. The company hasn’t announced any smart speaker
products, but Bloomberg reports that Facebook is working on a video chat home device
as well as a smart speaker, possibly targeted for release at the F8 Developer
Conference in early 2018.
Samsung
Samsung is the number one smartphone manufacturer in the world, with a market share
of 22% as of Q3 2017. It has tried to leverage this market power by moving up the stack
to value added software and services in order to differentiate itself from the competition.
To date, these efforts have been unsuccessful, and there is little to suggest things will
be different with Samsung’s voice assistant efforts.
Samsung’s initial voice assistant, S Voice, launched in 2012 with technology licensed
from Nuance. In 2016 Samsung acquired Viv, a voice assistant start up, to build the
successor to S Voice. The result is Bixby which, after months of delays, finally rolled out
in July 2017 to Galaxy S8 and S8 Plus users in the U.S.. An update, Bixby 2.0, was
introduced in October along with a private beta of a Bixby SDK. Bixby hasn’t been well
received and has been criticized for everything from its dedicated hardware on/off
button to inconsistent, unreliable performance. There is also the fact that Galaxy
smartphones already ship with Google Assistant. Do users really need another voice
assistant?
Baidu
Baidu, the Chinese internet search giant, is a serious voice assistant player in its home
market. It has deep AI expertise in house with a team of 1,300 researchers, and has
been adding to it with acquisitions like Raven Tech, Kitt.ai and XPerception. Baidu
unveiled a voice assistant named “Duer” at its developer conference in 2015. The voice
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 22
assistant, which is bundled with Baidu smartphone apps, was recently re-launched as a
full-fledged AI conversational development platform and re-branded as DuerOS.
Baidu claims the new DuerOS platform has more than 100 partners including device
manufacturers, chip makers, content providers and more. In November 2017 the
company announced a partnership with Chinese smartphone manufacturer, Xiaomi,
covering broad opportunities in AI and the Internet of Things. The partnership includes
close collaboration on DuerOS, and while neither company offered details, it is probable
that the Baidu voice assistant will be pre-installed or at least tightly integrated into the
Xiamoi smartphone. Also in November 2017 the company launched a smart speaker
and two robots using the Raven Tech technology. The smart speaker, dubbed “Raven
H”, is powered by the DuerOS.
Voice Assistant Market Adoption
The voice assistant market has two segments:
1. Embedded Segment: voice assistants can be embedded in smartphones, smart
watches, automobiles, home appliances and other devices. Of these, the
smartphone is clearly the most mature category and we will look to quantify
adoption in this category only. Adoption can be tracked in a number of ways
including voice assistant share of the installed base of smartphones, number of
voice assistant app downloads, and customer engagement and usage. We
provide some early usage/engagement metrics below.
2. Smart Speakers: devices such as the Amazon Echo and Google Home. This is a
consumer hardware category entirely built around the voice assistant and AI.
Adoption in this segment can be quantified in the usual way by tracking
shipments and installed base.
Embedded Segment: Smartphones
Our research shows that 72% of smartphone owners in the U.S. use a voice assistant to
supplement their primary search engine. There are significant differences in adoption
between iOS users and Android, perhaps reflecting the maturity, brand recognition and
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 23
tight integration of Siri in the iPhone: 84% of iOS consumers use a voice assistant to
supplement their primary search engine compared to 61% of Android users.
Figure 3: Respondents Using a Voice Assistant
Respondents overwhelmingly selected Google as their primary search engine. However
a surprising 13% of iPhone users selected Siri, making it number two after Google.
Yes72%
Yes84%
Yes61%
No28%
No16%
No39%
0%
20%
40%
60%
80%
100%
All Users iOS Android
Perc
ent
of
Respondents
Respondents Using a Voice Assistant to Supplement Their Primary Smartphone Search Engine
Source: Fivesight Research Search Engine Survey, Q1 2017Base: All users, n=740; iOS, n=350; Android, n=390Note: Percentages rounded
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 24
Figure 4: Primary Smartphone Search Engine
Similar results emerged from a UK study of smartphone users conducted in February
2017. Overall, 60% of respondents reported use of a voice assistant at varying levels of
frequency.
78%
13%
4%
2%
1%
3%
90%
1%
1%
2%
2%
4%
Siri
Yahoo
Browser
Bing
Other
What is your primary search engine on your smartphone?
iOS
Android
Source: Fivesight Research Search Engine Survey, Q1 2017Base: iOS, n=400; Android, n=400Note: Percentages rounded
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 25
Figure 5: Voice Assistant Usage
A report from eMarketer suggests a roughly similar usage profile for U.S. smartphone
owners. Their research predicts that 27.5% of U.S. smartphone owners will use a voice
assistant at least once a month in 2017.
Smart Speakers
Amazon began shipping Echoes in 2014 and two years later Google entered the
market. Neither company releases sales figures for smart speakers. VoiceLabs
estimates shipments have grown from 300 thousand units in 2014 to 6.5 million units in
2016. It predicts
shipments will almost
quadruple in 2017 to
24.5 million units.
Assuming all devices
remain in service since
2014 this would total up
to an installed base of
33 million units in 2017.
18%
19%
23%
28%
12%
Regularly(at least once a
week)
Occasionally(at least once a
month)
Only once ortwice
Never but wouldconsider
Never used andwould notconsider
Smartphone Voice Assistant Usage
Source: Speakeasy Survey, February 2017Base: UK smartphone users, n=1002
0.31.7
6.5
24.5
0
5
10
15
20
25
30
2014 2015 2016 2017
Units (
Mill
ions)
Smart Speaker Shipments
Source: VoiceLabs http://voicelabs.co/2017/01/15/the-2017-voice-report/
Figure 6: Smart Speaker Shipments
Fivesight White Paper: Echoes From Audrey December 2017
©2017 Fivesight Research LLC, www.fivesightresearch.com Copying Prohibited Page 26
A report from Edison Research illustrates the vast potential of this nascent market. Their
study indicates that only 7% of U.S. consumers own a smart speaker. While this may be
encouraging for Apple and others just entering the market, they will have to compete
against Amazon’s category-defining Echo product line. According to eMarketer Amazon
dominates the current market with a 71% share of smart speaker users. Google is a
distant second with a 24% share. Both will compete vigorously to consolidate their early
market leadership.
top related