echoes from audrey: voice assistants--the sound of machine intelligence

Echoes From Audrey Voice Assistants: The Sound of Machine Intelligence Fivesight Research LLC PO Box 2341 Red Bank, NJ 07701 www.fivesightresearch.com

Fivesight White Paper: Echoes From Audrey December 2017

Table of Contents

What is a Voice Assistant? ........................................................................................... 4

Voice Assistant Solution Stack ............................................................................................... 5

Speech Recognition: A Brief History ........................................................................... 5

The Pre-Modern Era: 1952—1970 ..................................................................................... 5

ARPA Speech Understanding Research Program .............................................................. 7

Hidden Markov Models ....................................................................................................... 7

Siri ...................................................................................................................................... 8

AI and Artificial Neural Networks ........................................................................................ 9

Voice Assistant Market Landscape ........................................................................... 10

Siri.........................................................................................................................................10

Background .......................................................................................................................10

Siri Ecosystem ..................................................................................................................11

New Features ....................................................................................................................11

Smart Speaker--HomePod ................................................................................................12

Google Assistant ...................................................................................................................12

Background .......................................................................................................................12

Google Assistant Ecosystem .............................................................................................13

New Features ....................................................................................................................14

Smart Speaker—Google Home Products ..........................................................................16

Amazon Alexa .......................................................................................................................16

Background .......................................................................................................................16

Alexa Ecosystem ...............................................................................................................17

New Features ....................................................................................................................18

Smart Speaker—Echo Products ........................................................................................19

The Others—Microsoft, Facebook, Samsung, Baidu .............................................................20

Voice Assistant Market Adoption .............................................................................. 22

Embedded Segment: Smartphones.......................................................................................22

Smart Speakers ....................................................................................................................25

Table of Contents for Figures and Tables

Figure 1: Voice Assistant Market Landscape ................................................................ 10

Figure 2: Echo Product Line .......................................................................................... 20

Figure 3: Respondents Using a Voice Assistant ........................................................... 23

Figure 4: Primary Smartphone Search Engine .............................................................. 24

Figure 5: Voice Assistant Usage ................................................................................... 25

Figure 5: Smart Speaker Shipments ............................................................................. 25

About Fivesight Research

Fivesight Research is a boutique research firm specializing in search, voice assistants

and the emerging AI technologies breaking apart the classic “plain old search box”. We

track and analyze the landscape of search through dedicated, ongoing coverage of

Google and the markets in which it operates.

For questions or comments on this report or Fivesight’s broader research program

please contact Joe Buzzanga at j.buzzanga@fivesightresearch.com.

Introduction

In 1952 Bell Labs researchers created the first speech recognition system. They named

it Audrey, and it was capable of understanding 10 spoken digits (“0” to “9”), separated

by pauses and uttered by a designated speaker. Fast forward 59 years to 2011. That

was the year Apple launched Siri on the iPhone, boldly introducing a new type of

intelligent voice application to a mass market. Three years later Amazon shocked the

tech industry by preemptively unveiling Echo, the first voice activated home appliance.

Today, voice assistants have crossed the chasm into the mass market. They are

embedded in iOS and Android devices and power the smart speaker product category

pioneered by Amazon. They have crossed a technology chasm as well: the technology

finally works well enough to be useful. Voice assistants aren’t intelligent in any

meaningful sense, but the speech recognition works well and continues to improve.

Meanwhile, artificial intelligence (AI) research progresses at an accelerating rate with

innovations in algorithms, models and silicon promising to increase the voice assistant’s

cognitive endowment.

It has been a long journey but the voice assistant is just the beginning of an AI

revolution that promises—some might say threatens—to transform industries, markets,

and society as a whole. This white paper analyzes the current voice assistant market

and technology landscape while providing historical context around the evolution of the

core speech recognition technology.

What is a Voice Assistant?

Siri, Google Assistant, and Alexa are examples of products that fall under the somewhat

vaguely defined category of voice assistants. We propose a more formal definition

below, isolating the essential differentiating attributes of these emerging products:

• AI—powered: artificial intelligence technology forms the basis of a voice

assistant, allowing it to simulate aspects of human intelligence including voice

recognition and synthesis, language understanding, and world knowledge.

There is no prescription regarding the type of AI, but most modern systems are

hybrids incorporating machine learning, knowledge graphs and other

techniques

• General purpose—a voice assistant aspires to provide a wide range of

services. It is general purpose and open-ended, designed to answer questions

and provide services across many domains and tasks. This helps distinguish it

from a chatbot which is typically tuned and tailored for one function (airline

reservation, customer service..)

• Software application—It’s important to distinguish the voice assistant from any

specific hardware implementation or device. The voice assistant is a software

application that can be embodied in lots of different devices: smartphones,

wearables, smart speakers, automobiles etc.,

• Simulates intelligence—Ideally, a voice assistant would pass an unrestricted

Turing test. A human interrogator could ask it anything at all and in each case

would not be able to tell whether the response came from human or machine.

This remains a distant goal. Today a voice assistant succeeds by simulating a

narrow set of capabilities that give the appearance of cognition. For example:

o Conversational ability: voice recognition, speech synthesis and natural

language processing (NLP).

o Knowledge of facts about the world, broadly construed, including but not

limited to people, places and things.

o Recognition of usage patterns allowing it to make informed “guesses”

that anticipate my needs based on apps I’ve been using, places I’ve

been etc.,

o Personalization: tailoring answers and actions to the user.

A voice assistant is an AI—powered, general purpose

software application that simulates intelligence through

conversational (vocal) interaction, factual knowledge,

predictive abilities and personalization

o Prediction: proactively suggesting answers and actions before the user

Voice Assistant Solution Stack

How does the voice assistant perform its magic? It starts with data. The web is one

source, but not the only one. Personal data (contacts, calendar appointments, etc.),

location data, and other on-device sensor information are a fundamental data source as

well. This device specific and personal profile data is combined with the web to create

the raw material for the AI technology, which can be cloud based, device based or a

hybrid of the two.

The AI layer exploits the data to build its

“intelligence”, with components such as world

knowledge (knowledge of facts), speech

recognition and NLP. The core speech

recognition and voice control operate in two

directions: Speech to text, (the assistant

understands spoken utterances) and text to

speech (the assistant responds with a human-

like voice using speech synthesis).

APIs and SDKs extend the functionality of the

voice assistant, making it a platform for 3rd party

The HW/Interface layer adapts the voice

assistant to diverse device types.

Services and apps are the actual functions exposed to the user and may be developed

by the voice assistant vendor or ecosystem partners. For example, “play a song”, “set a

calendar appointment”, “turn on the lights” etc.,

Speech Recognition: A Brief History

The Pre-Modern Era: 1952—1970

Early researchers were faced with the formidable challenge of creating speech

recognition systems before the advent of sufficiently powerful and widely deployed

general purpose computing technology. The first documented speech recognizer,

excluding novelties and toys, was created by Bell Labs researchers in 1952. They called

it Audrey (Automatic Digit Recognizer) and it was capable of recognizing 10 spoken

digits (“0” to “9”), separated by pauses and uttered by a designated speaker. In other

words, it was speaker dependent, limited to a fixed, predefined vocabulary and able to

recognize only numbers spoken discretely with deliberate boundaries formed by distinct

pauses.

Bell Labs wasn’t the only industrial research institution experimenting with speech

recognition. The other U.S. R&D colossus, IBM, was also keenly interested and pushed

the state of the art forward in this early period and up to contemporary times. An early

example was “Shoebox” , a machine that

could do simple math calculations via

voice commands. The system, which was

demonstrated on TV and at the 1962

Seattle World’s fair, recognized ten digits

and six control words.

Audrey, Shoebox and other early

systems were impressive achievements

for their time, but they couldn’t scale

technically or economically and never found commercial applications. Although there

were multiple limitations to all of these pioneering efforts, three stood out and wouldn’t

be solved for decades:

1. Limited Vocabulary—restricted to a fixed, predefined vocabulary, at most ~100

2. Word isolation—capable of recognizing only isolated words rather than

continuous speech

3. Speaker dependence—limited to working with a speaker(s) specifically trained

with the system

Speech recognition research flourished throughout the 1960s, but there were those who

were critical of its aims, methods and apparent lack of progress. The most prominent

was the eminent Bell Labs scientist, J.R. Pierce, who issued a scathing critique in 1969

in a letter to the Journal of Acoustical Society of America entitled “Whither Speech

Recognition”. Pierce questioned the whole enterprise of speech recognition, suggesting

that the core problems were unsolvable and that the research was unscientific and

irresponsibly conducted without serious thought as to its purpose and likelihood of

success. His letter provides a fitting, if disappointing, end to the 1960s.

IBM “Shoebox”

ARPA Speech Understanding Research Program

In 1971 work on speech recognition was given a new impetus thanks to the Speech

Understanding Research (SUR) program instituted by the U.S. Advanced Research

Projects Agency (ARPA). The SUR project funded a $15 million, five year program to

develop a large vocabulary, continuous speech recognition system meeting a set of

detailed performance goals recommended by a study group chaired by leading AI

researcher and Carnegie Institute of Technology professor Allen Newell. As a side note,

a National Academy of Sciences committee headed by Pierce strongly opposed the

program.

The program ended in 1976 and four prototypes were demonstrated. Only one, the

Harpy system from Carnegie Mellon, managed to meet the original performance

objectives.1 It was able to process connected speech from multiple speakers with 95%

accuracy based on a large (1,011 words) vocabulary.

Despite the success of Harpy, the SUR program was generally taken as a vindication of

Pierce’s critique and further evidence of the immaturity of the speech recognition field.

The government declined to extend the program.

Dr. Raj Reddy, a prominent speech recognition scientist and leader of the Carnegie

Mellon SUR team, offered this assessment of the field in a review article published

contemporaneously with the conclusion of the SUR program:

We are still far from being able to handle relatively unrestricted dialogs from a large population of speakers in uncontrolled environments. Many more years of intensive research seems necessary to achieve such a goal.2

While his prediction was correct, today it is clear that the methods and insights created

by the SUR funded systems were important advances that would prove to be

foundational for most contemporary speech recognition solutions.

Hidden Markov Models

Harpy was itself a partial synthesis of two other Carnegie Mellon speech recognition

systems: Hearsay-1 and Dragon. The latter, designed by Dr. James Baker, pioneered

1 Harpy actually exceeded two of the target metrics, one dealing with error rates and the other specifying

compute cycle budget. 2 Reddy, Dabbala Rajagopal. "Speech recognition by machine: A review." Proceedings of the IEEE 64.4

(1976): p.528.

the use of Hidden Markov Models (HMM) in speech recognition. The key Dragon

innovation was to reframe the speech recognition task wholly as a mathematical,

probabilistic problem based on HMM:

The most significant feature of the Dragon system, as compared to most other current speech recognition systems, is its almost total lack of speech-dependent heuristic knowledge. Dragon treats speech recognition as a mathematical computation problem rather than as an artificial intelligence problem.3

The HMM approach would eventually come to dominate modern speech recognition

systems until being supplanted by artificial neural network (ANN) AI technology.

Dr. Baker and his wife founded Dragon Systems in 1982 to commercialize the Dragon

technology. In that same year, the futurist, inventor and serial entrepeneur, Ray

Kurzweil, started Kurzweil Applied Intelligence with the goal of creating a voice activated

word processor. Meanwhile, IBM continued its research and unveiled the experimental

PC-based Tangora speech recognizer.

These efforts led to the first wave of consumer, mass market speech recognition

products for the PC market in the early 1990s. These products included DragonDictate

(1990), IBM Personal Dictation System (1993) and Kurzweil Voice (1994). By the end of

the decade large vocabulary, continuous speech recognition programs were available

from these and other vendors.

There were also efforts to develop business applications of speech recognition, most

prominently in the telecommunications industry. AT&T introduced several systems for

customer care and operator services.

Siri brought speech recognition to the mass market when Apple included it on the

iPhone. The story of Siri has its roots in a 2003 DARPA project named PAL (Personal

Assistant that Learns). PAL was funded with a 5 year, $150 million budget and was

intended to develop technologies to help computers interact with humans in a more

powerful and natural way using reasoning, learning and other cognitive abilities. PAL

3 Lowerre, Bruce T. The HARPY speech recognition system. Carnegie-Mellon University, Pittsburgh, PA

Department of Computer Science, 1976.p.16

ultimately led to the creation of Siri and in this regard played a central role in

commercializing speech recognition technology.

SRI International, an independent, non-profit R&D organization, led a team under the

PAL program and organized its efforts in a project it called “CALO” for Cognitive

Assistant that Learns and Organizes. CALO spawned a number of commercial

applications, most famously, Siri which was spun out as an independent unit in 2007.

Apple acquired the company in 2010 and in 2011 introduced Siri as an integral part of

iOS in the iPhone 4s.

The original Siri speech engine was licensed from Nuance, a company specializing in

speech technology and formed in 2005 as a result of multiple corporate acquisitions

including the original Dragon Systems.

AI and Artificial Neural Networks

Today, roughly 40 years after Dr. Reddy frankly stated the limitations of speech

recognition technology, those limitations have been largely overcome. Modern voice

recognition systems can competently handle unrestricted dialogs from a large

population of speakers in uncontrolled environments. This competence stems from the

surprising success of an AI technology that sparked breakthroughs in a range of difficult

computer science applications such as handwriting recognition and image classification.

In the early 2000s researchers began to apply Artificial Neural Networks, a type of AI

technology, to speech recognition. In a seminal 2009 paper Geoffrey Hinton et.al.,

showed that a deep neural network—a multi-layered form of ANN—outperformed

conventional statistical techniques on a standard speech recognition benchmark test

set. Further work improved results even more and by 2012 Google had publicly stated

that it had converted to neural network technology for its voice search on Android.

Recent research shows deep neural network speech recognition approaching human

level performance, although there is some debate in the scientific community about how

to best measure performance on speech recognition tasks4. What is not in doubt is that

neural network technology has strikingly improved speech recognition and is catalyzing

the adoption of entirely new types of voice controlled devices and services. In the case

4 In an August 2017 blog entry Microsoft unequivocally claims to have matched the currently accepted human word error rate of 5.1% in its speech recognition system.

of voice assistants, all of the major vendors—Google, Amazon, Apple, Microsoft—have

adopted the technology to drive speech recognition.

Voice Assistant Market Landscape

Figure 1 summarizes the product offerings of the major voice assistant vendors. “Vision”

refers to the addition of computer vision features inside the voice assistant, a trend

emerging in 2017.

Figure 1: Voice Assistant Market Landscape

Sources: Company information, VoiceBot https://www.voicebot.ai/

Background

Apple was first to market thanks to its acquisition of Siri in 2010. It hasn’t capitalized on

this early mover advantage and today Siri isn’t demonstrably smarter or more capable

than competing voice assistants. But Siri enjoys strong, though not universally positive,

brand recognition thanks to Apple’s marketing and its integration with iOS. It has the

benefit of a massive, captive installed base on iOS: at Apple’s most recent developer

conference the company claimed that Siri had 375 million active monthly devices. The

Voice Assistant Landscape

ProductInitial

ReleaseDevices Vision 3rd Party Support Notes

Apple SiriOctober,

iPhone, iPad, AppleWatch,

AppleTV, CarPlay,

MacOS, HomePod, Smart

No SiriKit

• Siri was an acquisition

• Apple Home smart speaker launch

delayed to 2018

GoogleGoogle

AssistantMay, 2016

Android, AndroidWear,

Android TV, Android Auto,

Google Home, Pixelbook,

Smart Home, iPhone

Google

• Actions on Google

• Google Assistant

• 468 Actions

• Google Lens integration in Google

Assistant rolling out to Pixel phones

starting in late November

Microsoft Cortana April, 2014

iOS, Android, Windows,

xBox, Harmon Kardon

smart speaker

No• Skills Kit

• 174 Skills

• Harmon Kardon Invoke smart speaker

launched Oct. 22, 2017

Amazon Alexa Nov. 2014

Echo Devices, selected

smartphones from

Motorola, Huawei and

HTC, TV, Auto, Smart

Home, Cloud Cam

• Skills Kit

• Alexa Voice

Service

• 17,650 Skills

• Alexa inside Amazon App on iOS and

Android

• Vision/image recognition is part of

Amazon App

Facebook M

Aug. 2015

GA April,

Inside Messenger on

Android, iOSNo No

• Not voice activated

• Contextual suggestions

Samsung Bixby July, 2017

Samsung Galaxy S8,

Wearables, TV, Auto,

Smart Home

Bixby SDK (private

• S Voice, released in May 2012, was a

precursor.

• Bixby built on technology from Viv

acquisition.

Baidu DuerOS Sept. 2015

Android & iOS smartphone

apps, Raven H smart

speaker

NoDuerOS Open

Platform

• Smart speaker based on Raven Tech

acquisition

number, though, is virtually meaningless since “active monthly device” wasn’t defined,

nor was the time period for the metric.

We calculate an alternative view of Siri’s market presence by estimating the iPhone

installed base. Using industry statistics, we estimate that in 2016 there were roughly

600 million iOS devices globally, based on a 15% iPhone share of a global installed

base of 4 billion smartphones. All of these have ready access to Siri.

Siri Ecosystem

In 2016 Apple opened up Siri to 3rd party developers with the launch of SiriKit, its

framework for adding functionality to iOS apps. SiriKit supports the iPhone, Apple

Watch and with iOS 11.2, Apple’s forthcoming smart speaker, HomePod. However the

company restricts developers to a set of Apple-defined use cases and domains, such as

ride hailing, payments and a few others. In addition, Apple only permits app developers

to control specific actions within an existing iOS app. And the app integration occurs on

top of Siri in Apple controlled devices and environments. In other words, Apple doesn’t

permit developers to integrate Siri into their own hardware.

Apple’s stance with Siri is consistent with its DNA. Its strength is controlling the user

experience through tight integration of hardware, software and services. It wants

independent software vendors to enhance Siri, but only within the confines of Apple’s

business objectives and a user experience it controls. The result is that the ecosystem

around Siri is sparse at the moment, especially in comparison with Amazon and Google,

neither of whom place Apple-like restrictions on their ecosystem partners.

New Features

Although Siri has an immense captive installed base, given its perceived flaws it isn’t a

fait accompli that Apple can exploit it as the voice assistant evolves into what may be

the next major computing platform, essentially the vehicle that finally ushers in AI on a

mass market scale. Competition in the voice assistant space is intense and Apple has

been aggressively moving to improve Siri with the latest machine learning technologies

via acquisitions and hiring. Some of this technology is evident in new features

introduced for Siri on iOS 11:

• Speech synthesis for Siri’s voice using deep neural networks. Siri’s voice is more

natural and expressive

• On-device learning for predictive suggestions: Siri tracks user actions to provide

more personal and contextual suggestions, securely, privately and synched

across devices

• Beta of instant translation for five language pairs

• Increased developer support on SiriKit

Siri will also get a boost from custom silicon in the form of the iPhone X A11 Bionic

neural engine. Although the processor will initially function to optimize machine learning

algorithms for face recognition in the $1,000 iPhone X, Apple will undoubtedly employ it

to accelerate the AI technology powering Siri in the future

Smart Speaker--HomePod

Siri currently has no presence in the home, but that will change with the launch of Apple

HomePod. Originally slated for a December launch, the company issued a statement in

November saying it needed more time. Now the timeframe is early 2018.

Apple has chosen to position HomePod as a premium priced audio speaker rather than

an AI-first, intelligent home device. The value proposition tellingly centers on superior

sound quality and integration with Apple Music rather than the power and advanced

intelligence of Siri.

Apple’s smart speaker comes more than two years after Amazon created the category

with Echo and one year after Google launched Google Home. Being late to market is

nothing new for Apple and the company has historically relied on delivering a superior

user experience to win market share, even if competitors were there first. Whether that

will be sufficient in this case is open to question, especially considering the rapid pace

of product innovation by Amazon and Google. Both companies have already iterated on

the initial smart speaker, building out voice assistant powered product families for the

home in different form factors and at different price points.

Google Assistant

Background

Google has been offering voice search and voice assistant-like products (ie., Google

Now, Allo, Now on Tap) since 2011. In May 2016 it finally put all its wood behind one

arrow with Google Assistant, which was launched as part of Google Home, the

company’s answer to the Amazon Echo. In October 2016 it became available on

Google’s Pixel smartphones and today the Assistant is the foundation for the company’s

expanding line of “made by Google” branded consumer hardware. In 2017 the company

expanded the reach of Google Assistant to the broader ecosystem of Android device

makers as well as iOS smartphones.

Google regards voice assistants as the next major computing platform shift, powering

everything from smart speakers to intelligent vehicles. We can expect to see it double

down on investments in product development around Google Assistant to exploit its

industry leading machine learning technology and to compete against Alexa, Siri, and

other contenders.

Google Assistant Ecosystem

In 2017 Google successfully increased the Assistant footprint beyond Google branded

products to a wide variety of products from 3rd party manufacturers including:

• Android smartphones running Android 6.0 Marshmallow or Android 7.0 Nougat

• Android TV: Google Assistant embedded in NVIDIA Shield, along with all Android

TVs in the U.S. running Android 6.0 Marshmallow or Android 7.0 Nougat

• Android Wear 2.0 Watches: LG Watch Style and LG Watch Sport—both

designed in collaboration with Google.

• iPhone: on iOS 9.1+

• Bose QC 35 ll Headphone: via Bluetooth pairing with Android or iOS smartphone

• Speakers: including products from JBL, Panasonic, Sony and more

Overall, the company says that the Assistant is available on 100 million Android

devices. That number may seem large, but it is only a small fraction of the Android

installed base and highlights the challenge Google faces with its semi-open, horizontal

business model.

With the exception of Google’s own branded hardware, the Assistant’s market

penetration is limited by the fragmentation of the Android ecosystem: Android

Marshmallow 6.0, Nougat 7.0 and Oreo (the latest version of Android) represent 50% of

Android phones (based on Playstore visits) according to one estimate. Although this

portion of the Android ecosystem is capable of running Google Assistant, not all of the

OEMs are aligned with Google’s business objectives. The biggest Android OEM,

Samsung, is frankly unaligned and is pushing its own competing voice assistant, Bixby,

on its newest smartphones.

This Android ecosystem fragmentation explains why Google has gotten serious about

hardware and launched its own line of branded “made by Google” devices. Google is

determined to intercept the AI platform shift in computing and it cannot rely solely on an

ecosystem of ambivalent partners. It will build its own devices, integrating AI, hardware

and software. If there was any doubt about its seriousness, the acquisition of 2,000 HTC

hardware engineers, should put them to rest.

But 3rd party developers are still key for Google Assistant to scale into a real platform as

computing shifts to an AI first technology foundation. Developers can build on Google

Assistant in two ways:

• Actions on Google—lets developers build apps for the Google Assistant

• Google Assistant SDK—lets developers integrate Google Assistant into their own

devices including functionality such as hotword detection, voice control, and

natural language understanding.

So far, industry reports suggest Google is a distant second to Amazon when it comes to

developer support. According to VoiceBot, there were 468 Actions for Google Assistant

compared to 17,650 Skills for Alexa as of July 2017.

New Features

In 2017 Google announced significant new features for Google Assistant end users and

developers including:

Google Assistant End User Features

• Google Lens: point the smartphone camera at an object to trigger relevant

actions and information. Integrated in the Google Assistant on Pixel phones.

• Improved speech synthesis for Google Assistant’s voice using DeepMind’s

WaveNet technology.

• Proactive notifications on Google Home.

• Routines: expands preset routines giving users more options and control to

trigger a series of actions using a single command.

• Multi-user support aka Voice Match: Google Assistant on Google Home can

recognize up to six individuals based on their voice and tailor responses

accordingly.

• Keyboard input: Google Assistant supports typed queries.

• Bluetooth audio streaming.

• Hands free voice calling on Google Home to any landline or wireless number in

the U.S. and Canada.

• Voice broadcast: users can broadcast a message to all Google Assistant enabled

speakers.

• Visual Google Assistant responses from TVs with Chromecast.

• Shop with Walmart: partnership with Walmart for personalized shopping using

Google Assistant.

• Shopping on Google Assistant: Google Assistant users can shop from

participating Google Express retailers.

• Schedule Calendar appointments and create reminders.

• Expanded music and entertainment support.

• International expansion: Google Home in the U.K., Canada (English and French),

Australia, Germany, France and Japan. The Assistant on eligible Android phones

and iPhones is also available in Brazilian Portuguese, Japanese, Korean with

Italian, Spanish (in Mexico and Spain) and Singaporean English promised in the

near future.

Google Assistant Developer Tools

• Actions on Google launched on Android and iOS smartphones.

• Google Assistant SDK: preview launched in April; updated in May.

• Multilingual support: Apps can be created in German, French, Japanese, Korean,

Spanish, Italian, Portuguese, French and English in Canada.

• Apps launched in the UK and Australia.

• Pixelbook support: Apps will run on Pixelbooks.

• AIY Voice Kit: do-it yourself kit for developers to build a standalone voice

recognition system using Google Assistant.

• Revamped App directory to improve discovery.

• App device handoff: start an interaction on a smart speaker and handoff to the

phone.

• Personalization: collect user preferences to personalize app interactions.

• Features promoting app re-engagement including updates and push notifications.

• Family friendly apps: developers can have apps certified as “family friendly”.

• Templates for app creation: coding not required.

• Transactions: support for transactional apps on Google Assistant on phones.

Smart Speaker—Google Home Products

Google, like Amazon, is

developing a family of smart

home speakers in different form

factors and price points. Google

Home was the first and launched

in November 2016. Nearly one

year later in October 2017 it

added two others: Google Home

Mini and Google Home Max. The

Mini ($49) is an entry level device and is a more or less complete imitation of Amazon’s

Dot, which was introduced more than one year earlier. Google Home Max ($399) is a

preemptive strike against Apple’s upcoming HomePod. Like the HomePod, the Max is a

premium priced smart speaker with a value proposition built around a superior audio

experience. Google Home ($129) was the company’s original entry in the smart speaker

category. Although no product has been announced, Google is rumored to be working

on a Home device with a screen to compete with Amazon’s Echo Show.

Google claims that Google Home works with more than 1,000 smart home devices from

more than 150 popular

brands. Many of these

devices can be

controlled by Alexa as

well, so 3rd party device

support isn’t a strong

competitive

differentiator.

Amazon Alexa

Background

In November 2014 Amazon launched the Echo with Alexa as its integrated voice

assistant. The device was the first of its kind and established the smart speaker product

category. Few in the tech world would have expected Amazon to succeed with a mass

market consumer device, especially after its debacle with the Fire smartphone. Yet

three years after its introduction, the Echo dominates one of the most hotly contested

and fast growing consumer electronic product segments. Meanwhile, Amazon has

continued to build on its first to market advantage with a relentless pace of product

innovation and aggressive pricing.

Today the company has the most extensive smart speaker line up in the industry with a

family of devices at different price points, in different form factors, and tailored for

different use cases. And it is a good thing since Alexa isn’t widely distributed on

smartphones. Notable exceptions include the Huawei Mate 9 and the Motorola Moto X4

both of which offer Alexa integration. Another Android vendor, HTC, introduced an

Alexa-enabled smartphone, the U11 in 2017, but Alexa isn’t pre-installed; the user has

to download it from the Google PlayStore. Alexa is accessible inside the Amazon

shopping app as well. Otherwise, Amazon offers both iOS and Android Alexa apps, but

they don’t act as full-fledged voice assistants. They mainly control Echo devices, which

are presumed to be present.

Alexa Ecosystem

Independent developers can participate in the Alexa ecosystem in a number of ways:

• Alexa Skills Kit: The Alexa Skills Kit (ASK) is a collection of self-service APIs,

tools, documentation, and code samples for developers. Using ASK, developers

can add capabilities to Alexa by creating “skills”. Amazon claims that more than

27,000 skills have been created to date.

• AVS (Alexa Voice Service): Device manufacturers can add intelligent voice

control to any connected product that has a microphone and speaker. In 2017

Amazon launched an AVS SDK to simplify and accelerate the creation of voice

enabled ecosystem products.

• Alexa Smart Home: Smart home device manufacturers can add Alexa voice

control to their products using the Smart Home Skills API.

• Alexa Gadgets: Announced in 2017, Gadgets are a new category of connected

products and developer tools that turn a compatible Echo device into a hub for

interactive play. Developers can build skills for Gadgets using the Alexa Gadgets

Skills API. They also have the option of creating their own Gadgets using the

Alexa Gadgets SDK.

The developer community around Alexa is by far the largest of any voice assistant

reflecting Amazon’s first mover advantage combined with the company’s rich set of

development tools and offerings.

New Features

Amazon introduced new Alexa end user and developer features in 2017 as well as new

Echo devices.

Alexa End User Features

• Alexa Routines: customers can trigger a series of actions using a single voice command of their choice. Google announced a similar feature in October.

• Alexa Groups: allows customers to combine discrete smart home appliances in a group.

• Device View/Control: customers can see and control smart home appliances in

the Alexa app.

• Custom Alexa Lists: allows users to create and organize lists of their own choice.

• Alexa Calling and Messaging: Customers can place and receive calls using Echo

devices or the Alexa app.

• Alexa/Cortana Interworking: Alexa and Cortana can “talk” to each other giving

customers access to the respective strengths of each voice assistant.

• Alexa in Amazon Music App: embeds Alexa inside Amazon music app on iOS

and Android.

Smart Home Developer Tools

• Alexa Smart Home Skill API: updated API for creating Alexa smart home enabled appliances such as thermostats, lights, cameras, etc.

o Lock Control and Query: allows developers to voice control devices with

locks.

o Thermostat Query: simplifies development of temperature controlled

devices.

o Tunable Lighting Control: lets developers voice control color changing

lights or tunable white lights.

o Smart Home Camera Support: allows developers to show live streams

from smart home cameras on Echo Show.

o Entertainment Controls: allows Alexa to control various cloud connected

devices such as TVs, AV receivers.

Alexa Voice Service (AVS) Developer Tools

• AVS Device SDK: accelerates and simplifies development of Alexa voice enabled products.

• AVS Notifications: allows Alexa to proactively deliver content.

Display and Video Developer Tools

• Alexa Skills Kit Updates: lets developers update skills to take advantage of the display and video interfaces on the Echo Show.

• Display Cards: adds visuals to an Alexa enabled product complementing Alexa

voice responses for music, weather, calendar and more.

• Video Skill API: supports development of Alexa skills for video devices, content

and services.

Other Developer Tools

• Cloud-Based Wake Word Verification: improves accuracy of wake word detection by adding cloud verification to initial device wake word detection.

• Alexa Gadgets: developer tools for creating gaming and entertainment experiences.

• List Events: notifies a skill when a user makes a change to an Alexa list.

• Alexa 7-Mic Far-Field Development Kit: hardware-based reference design

enabling device manufacturers to build voice enabled products using Amazon’s

far-field mic and voice processing technology.

Smart Speaker—Echo Products

Amazon is rapidly versioning Alexa-powered devices, building out a family of products.

In September, the company announced what it called the “next generation" of Echo

devices:

• Echo (2nd generation): A lower-priced refresh of the original Echo with enhanced

speakers and Dolby sound.

• Echo Plus: includes a built-in Zigbee hub for home automation.

• Echo Spot: compact Echo with display; shipping Dec. 19.

• Echo Buttons: first product in a new category of connected products called “Alexa

Gadgets”. These are dedicated hardware products for playing games through an

Echo device.

Products announced earlier in 2017 include:

• Echo Look, announced April 26, includes a hands-free, depth sensing camera

and “style assistant”.

• Echo Show, featuring a 7 inch screen, was announced in May and began

shipping June 28.

• Dash Wand with Alexa, is a voice or barcode activated device for ordering

Amazon products, answering simple queries, and controlling home appliances. It

was launched in June.

These devices join the original Echo, Amazon Dot, and Amazon Tap family of Alexa-

powered devices. Not all of these will be successful, but the cadence of Amazon’s

product development and release cycle is so far unmatched by Google (and Apple, for

that matter).

The Echo smart speaker line up is summarized below.

Figure 2: Echo Product Line

Note: Amazon Tap ($129.99) is also Alexa controlled but isn’t part of the Echo brand. Echo devices are

positioned as smart home devices. Tap, though it has Alexa smarts, is a portable bluetooth speaker and

so is kept separate from the Echo line.

The Others—Microsoft, Facebook, Samsung, Baidu

Microsoft, Samsung, and Facebook are second tier players at the moment, while Baidu

isn’t yet a major player outside of its home market in China.

Microsoft

Microsoft is handicapped by the lack of a captive mobile platform for Cortana. On the

other hand, it has deeply integrated Cortana into Windows 10, making it an integral part

of the desktop environment. The open question is whether the desktop computing user

experience is really enhanced by a voice assistant. Microsoft believes it is, and during

its Q42016 earnings call it reported that the Cortana search box had over 100 million

active monthly users with 8 billion questions asked to date. Another data point was

offered at the company’s 2017 Build conference: it reported that Cortana had more than

141 million unique users on a base of 500 million Windows 10 devices.

In the smart speaker segment, Microsoft has elected to work with partners rather than

build its own branded device. Its first product, the Harmon Kardon Invoke ($199.95)

launched in October. It’s a me-too product, with an attempt at differentiation based on

Harmon Kardon’s audio pedigree and Cortana’s integration with the Microsoft

ecosystem. Microsoft will need to quickly find a way to expand on this initial product,

either with Harmon Kardon or with other partners if it is to be successful against the

more evolved smart speaker offerings from Amazon and Google.

Echo Product Line

Echo Connect Echo Dot Echo Echo Spot Echo Plus Echo Look Echo Show

Price $34.99 $49.99 $99.99 $129.99 $149.99 $199.99 $229.99

Description

Accessory that

converts an

Echo into a

voice controlled

speakerphone

Entry level

smart home

device

2nd generation of

flagship device

with enhanced

Compact

Echo with

screen

Echo with built

in Zigbee hub

for connecting

smart home

devices

Hands-free

camera and

”style assistant”

Large screen Echo

with dual 2.0”

speakers

optimized for

higher quality

audio and video

Display No No No Yes, 2.5” No No Yes, 7”

Integrated

HubNo No No No Yes No No

Facebook

Strictly speaking, Facebook isn’t a player in the voice assistant space. It does have a

feature called “M” inside Messenger that acts like an assistant but the original version

was text driven and primarily powered by humans rather than AI. In April 2017 it

announced “Suggestions” from M which does rely entirely on AI to offer helpful actions

based on its interpretation of conversations inside Messenger.

Facebook has built up a world class AI team and there is every reason to suppose that

it will be able to launch a voice assistant in the near future matching or surpassing Siri,

Google Assistant and Alexa. The company hasn’t announced any smart speaker

products, but Bloomberg reports that Facebook is working on a video chat home device

as well as a smart speaker, possibly targeted for release at the F8 Developer

Conference in early 2018.

Samsung

Samsung is the number one smartphone manufacturer in the world, with a market share

of 22% as of Q3 2017. It has tried to leverage this market power by moving up the stack

to value added software and services in order to differentiate itself from the competition.

To date, these efforts have been unsuccessful, and there is little to suggest things will

be different with Samsung’s voice assistant efforts.

Samsung’s initial voice assistant, S Voice, launched in 2012 with technology licensed

from Nuance. In 2016 Samsung acquired Viv, a voice assistant start up, to build the

successor to S Voice. The result is Bixby which, after months of delays, finally rolled out

in July 2017 to Galaxy S8 and S8 Plus users in the U.S.. An update, Bixby 2.0, was

introduced in October along with a private beta of a Bixby SDK. Bixby hasn’t been well

received and has been criticized for everything from its dedicated hardware on/off

button to inconsistent, unreliable performance. There is also the fact that Galaxy

smartphones already ship with Google Assistant. Do users really need another voice

assistant?

Baidu, the Chinese internet search giant, is a serious voice assistant player in its home

market. It has deep AI expertise in house with a team of 1,300 researchers, and has

been adding to it with acquisitions like Raven Tech, Kitt.ai and XPerception. Baidu

unveiled a voice assistant named “Duer” at its developer conference in 2015. The voice

assistant, which is bundled with Baidu smartphone apps, was recently re-launched as a

full-fledged AI conversational development platform and re-branded as DuerOS.

Baidu claims the new DuerOS platform has more than 100 partners including device

manufacturers, chip makers, content providers and more. In November 2017 the

company announced a partnership with Chinese smartphone manufacturer, Xiaomi,

covering broad opportunities in AI and the Internet of Things. The partnership includes

close collaboration on DuerOS, and while neither company offered details, it is probable

that the Baidu voice assistant will be pre-installed or at least tightly integrated into the

Xiamoi smartphone. Also in November 2017 the company launched a smart speaker

and two robots using the Raven Tech technology. The smart speaker, dubbed “Raven

H”, is powered by the DuerOS.

Voice Assistant Market Adoption

The voice assistant market has two segments:

1. Embedded Segment: voice assistants can be embedded in smartphones, smart

watches, automobiles, home appliances and other devices. Of these, the

smartphone is clearly the most mature category and we will look to quantify

adoption in this category only. Adoption can be tracked in a number of ways

including voice assistant share of the installed base of smartphones, number of

voice assistant app downloads, and customer engagement and usage. We

provide some early usage/engagement metrics below.

2. Smart Speakers: devices such as the Amazon Echo and Google Home. This is a

consumer hardware category entirely built around the voice assistant and AI.

Adoption in this segment can be quantified in the usual way by tracking

shipments and installed base.

Embedded Segment: Smartphones

Our research shows that 72% of smartphone owners in the U.S. use a voice assistant to

supplement their primary search engine. There are significant differences in adoption

between iOS users and Android, perhaps reflecting the maturity, brand recognition and

tight integration of Siri in the iPhone: 84% of iOS consumers use a voice assistant to

supplement their primary search engine compared to 61% of Android users.

Figure 3: Respondents Using a Voice Assistant

Respondents overwhelmingly selected Google as their primary search engine. However

a surprising 13% of iPhone users selected Siri, making it number two after Google.

Yes72%

Yes84%

Yes61%

All Users iOS Android

Respondents

Respondents Using a Voice Assistant to Supplement Their Primary Smartphone Search Engine

Source: Fivesight Research Search Engine Survey, Q1 2017Base: All users, n=740; iOS, n=350; Android, n=390Note: Percentages rounded

Figure 4: Primary Smartphone Search Engine

Similar results emerged from a UK study of smartphone users conducted in February

2017. Overall, 60% of respondents reported use of a voice assistant at varying levels of

frequency.

Google

Browser

What is your primary search engine on your smartphone?

Android

Source: Fivesight Research Search Engine Survey, Q1 2017Base: iOS, n=400; Android, n=400Note: Percentages rounded

Figure 5: Voice Assistant Usage

A report from eMarketer suggests a roughly similar usage profile for U.S. smartphone

owners. Their research predicts that 27.5% of U.S. smartphone owners will use a voice

assistant at least once a month in 2017.

Smart Speakers

Amazon began shipping Echoes in 2014 and two years later Google entered the

market. Neither company releases sales figures for smart speakers. VoiceLabs

estimates shipments have grown from 300 thousand units in 2014 to 6.5 million units in

2016. It predicts

shipments will almost

quadruple in 2017 to

24.5 million units.

Assuming all devices

remain in service since

2014 this would total up

to an installed base of

33 million units in 2017.

Regularly(at least once a

Occasionally(at least once a

month)

Only once ortwice

Never but wouldconsider

Never used andwould notconsider

Smartphone Voice Assistant Usage

Source: Speakeasy Survey, February 2017Base: UK smartphone users, n=1002

0.31.7

2014 2015 2016 2017

Units (

Smart Speaker Shipments

Source: VoiceLabs http://voicelabs.co/2017/01/15/the-2017-voice-report/

Figure 6: Smart Speaker Shipments

A report from Edison Research illustrates the vast potential of this nascent market. Their

study indicates that only 7% of U.S. consumers own a smart speaker. While this may be

encouraging for Apple and others just entering the market, they will have to compete

against Amazon’s category-defining Echo product line. According to eMarketer Amazon

dominates the current market with a 71% share of smart speaker users. Google is a

distant second with a 24% share. Both will compete vigorously to consolidate their early

market leadership.

echoes from audrey: voice assistants--the sound of machine intelligence

Technology

echoes - fofa

evergreen echoes spring 2009 - wa-aklwml.org echoes...

polar mesosphere summer radar echoes& observations ...nature...

echoes from mt. olympus echoes from mt. olympus

echoes 1932

echoes 1942

1 u.s. commission on civil rights + + + + + briefing on...

ancient echoes

echoes toddler

echoes 2013

revista echoes

echoes ~ echoes - ca-nvmoose.org · echoes ~ echoes ~...

tunnel echoes

echoes of echoes? an episodic theory of lexical access

band echoes

spin echoes

deadly echoes

roman echoes

echoes blue

audrey echoes 2