survey on deep neural networks in speech and vision systems

19
1 AbstractThis survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and speech applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation and development of intelligent vision and speech systems. With availability of vast amounts of sensor data and cloud computing for processing and training of deep neural networks, and with increased sophistication in mobile and embedded technology, the next-generation intelligent systems are poised to revolutionize personal and commercial computing. This survey begins by providing background and evolution of some of the most successful deep learning models for intelligent vision and speech systems to date. An overview of large- scale industrial research and development efforts is provided to emphasize future trends and prospects of intelligent vision and speech systems. Robust and efficient intelligent systems demand low-latency and high fidelity in resource constrained hardware platforms such as mobile devices, robots, and automobiles. Therefore, this survey also provides a summary of key challenges and recent successes in running deep neural networks on hardware-restricted platforms, i.e. within limited memory, battery life, and processing capabilities. Finally, emerging applications of vision and speech across disciplines such as affective computing, intelligent transportation, and precision medicine are discussed. To our knowledge, this paper provides one of the most comprehensive surveys on the latest developments in intelligent vision and speech applications from the perspectives of both software and hardware systems. Many of these emerging technologies using deep neural networks show tremendous promise to revolutionize research and development for future vision and speech systems. Index Terms—Vision and speech processing, computational intelligence, deep learning, computer vision, natural language processing, hardware constraints, embedded systems, convolutional neural networks, deep auto-encoders, recurrent neural networks. 1. INTRODUCTION HE twenty first century has seen rapid growth in computing power and a massive accumulation of human-centric data to an unprecedented scale. These advancements have rejuvenated the field of neural networks in innovating sophisticated intelligent system (IS) that has now become an indispensable part of everyday life. In the past, neural networks have not seen much success and the scope for IS has been limited to the application of industrial control and robotics. However, recent advancements in IS are permeating almost every aspect of our life with the introduction of intelligent transportation [1-10], intelligent diagnosis and health monitoring for precision medicine [11-14], robotics and automation in home appliances [15], virtual online assistance [16], e- marketing [17], and weather forecasting and natural disasters monitoring [18] among others. The widespread success of IS technology have redefined and augmented humans’ ability to communicate and comprehend the world with emergence of ‘smart’ physical systems. A ‘smart’ physical system is designed to interpret and act on complex multimodal human senses such as vision, touch, speech, smell, gestures, or hearing. Among these senses, a large variety of smart physical systems are developed targeting two primary senses used for human communication: vision and speech. The advancement in speech and vision processing has enabled tremendous research and development in the areas of human- computer interactions [19], biometric applications [20, 21], security and surveillance [22], and most recently in computational behavioral analysis [23-27]. While traditional machine learning and evolutionary computations have enriched IS to solve complex pattern recognition problems over many decades, these techniques have limitations in their ability to process natural data or images in raw data formats. A number of computational steps are used to extract representative features from raw data or images prior to applying machine learning models. This intermediate representation of raw data, known as ‘hand-engineered’ features, requires domain expertise and human interpretation of physical patterns such as texture, shape, geometry, etc. There are three major problems with ‘hand-engineered’ features that impede major progress in IS. First, the choice of ‘hand-engineered’ features is application dependent and requires involvement of human interpretation and evaluation. Second, ‘hand-engineered’ features are extracted from each sample in a standalone manner without the knowledge of inevitable noise and variations in data. Third, the ‘hand-engineered’ features may perform excellently with some input but may completely fail to extract quality features in other input. This can lead to high variability in vision and speech recognition performance. M. Alam, M. D. Samad 1 , L. Vidyaratne, A. Glandon and K. M. Iftekharuddin * Vision Lab in Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529 (email: malam001, lvidy001, aglan001, [email protected] *corresponding author). 1 Department of Computer Science, Tennessee State University, Nashville, TN, 37209 (email: [email protected]) Survey on Deep Neural Networks in Speech and Vision Systems T

Upload: others

Post on 14-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

1

Abstract—This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and

speech applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation and

development of intelligent vision and speech systems. With availability of vast amounts of sensor data and cloud computing for processing

and training of deep neural networks, and with increased sophistication in mobile and embedded technology, the next-generation

intelligent systems are poised to revolutionize personal and commercial computing. This survey begins by providing background and

evolution of some of the most successful deep learning models for intelligent vision and speech systems to date. An overview of large-

scale industrial research and development efforts is provided to emphasize future trends and prospects of intelligent vision and speech

systems. Robust and efficient intelligent systems demand low-latency and high fidelity in resource constrained hardware platforms such

as mobile devices, robots, and automobiles. Therefore, this survey also provides a summary of key challenges and recent successes in

running deep neural networks on hardware-restricted platforms, i.e. within limited memory, battery life, and processing capabilities.

Finally, emerging applications of vision and speech across disciplines such as affective computing, intelligent transportation, and

precision medicine are discussed. To our knowledge, this paper provides one of the most comprehensive surveys on the latest

developments in intelligent vision and speech applications from the perspectives of both software and hardware systems. Many of these

emerging technologies using deep neural networks show tremendous promise to revolutionize research and development for future vision

and speech systems.

Index Terms—Vision and speech processing, computational intelligence, deep learning, computer vision, natural language processing,

hardware constraints, embedded systems, convolutional neural networks, deep auto-encoders, recurrent neural networks.

1. INTRODUCTION

HE twenty first century has seen rapid growth in computing power and a massive accumulation of human-centric data to an

unprecedented scale. These advancements have rejuvenated the field of neural networks in innovating sophisticated intelligent

system (IS) that has now become an indispensable part of everyday life. In the past, neural networks have not seen much success

and the scope for IS has been limited to the application of industrial control and robotics. However, recent advancements in IS are

permeating almost every aspect of our life with the introduction of intelligent transportation [1-10], intelligent diagnosis and health

monitoring for precision medicine [11-14], robotics and automation in home appliances [15], virtual online assistance [16], e-

marketing [17], and weather forecasting and natural disasters monitoring [18] among others. The widespread success of IS

technology have redefined and augmented humans’ ability to communicate and comprehend the world with emergence of ‘smart’

physical systems. A ‘smart’ physical system is designed to interpret and act on complex multimodal human senses such as vision,

touch, speech, smell, gestures, or hearing. Among these senses, a large variety of smart physical systems are developed targeting

two primary senses used for human communication: vision and speech.

The advancement in speech and vision processing has enabled tremendous research and development in the areas of human-

computer interactions [19], biometric applications [20, 21], security and surveillance [22], and most recently in computational

behavioral analysis [23-27]. While traditional machine learning and evolutionary computations have enriched IS to solve complex

pattern recognition problems over many decades, these techniques have limitations in their ability to process natural data or images

in raw data formats. A number of computational steps are used to extract representative features from raw data or images prior to

applying machine learning models. This intermediate representation of raw data, known as ‘hand-engineered’ features, requires

domain expertise and human interpretation of physical patterns such as texture, shape, geometry, etc. There are three major

problems with ‘hand-engineered’ features that impede major progress in IS. First, the choice of ‘hand-engineered’ features is

application dependent and requires involvement of human interpretation and evaluation. Second, ‘hand-engineered’ features are

extracted from each sample in a standalone manner without the knowledge of inevitable noise and variations in data. Third, the

‘hand-engineered’ features may perform excellently with some input but may completely fail to extract quality features in other

input. This can lead to high variability in vision and speech recognition performance.

M. Alam, M. D. Samad1, L. Vidyaratne, A. Glandon and K. M. Iftekharuddin*

Vision Lab in Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529 (email: malam001, lvidy001, aglan001, [email protected] *corresponding author).

1 Department of Computer Science, Tennessee State University, Nashville, TN, 37209 (email: [email protected])

Survey on Deep Neural Networks in Speech and

Vision Systems

T

2

A solution to the limitations of ‘hand-engineered’ features has emerged through mimicking functions of biological neurons in

artificial neural networks (ANN). The potential of ANNs is recently being exploited with access to large trainable datasets, efficient

learning algorithms, and powerful computational resources. These new techniques in IS over the last decade are referred to as deep

learning [28, 29] which is impacting application domains such as computer vision, speech analysis, biomedical image processing,

and online market analyses. The rapid success of deep learning over traditional machine learning may be attributed to three aspects.

First, it offers end-to-end trainable architectures that integrate feature extraction, dimensionality reduction, and final classification.

These steps are standalone sub-systems in conventional machine learning systems, which result in suboptimal pattern recognition

performance. Second, useful and intermediate features can be optimally learned from both input examples and classification targets

without using one generic feature extractor for all applications. Third, deep learning methods are flexible enough to capture

underlying nonlinear relationships between inputs and output targets at a level far beyond the capacity of ‘hand-engineered’

features.

The remainder of this article is organized as follows. Section 2 discusses deep learning architectures that have been recently

introduced to solve contemporary challenges in vision and speech domain. Section 3 provides a comprehensive discussion of real-

world and commercial application cases for the technology. Section 4 discusses state-of-the-art results in implementing these

sophisticated algorithms in limited resource hardware environments. This section also highlights the prospects of ‘smart’

applications in mobile devices. Section 5 discusses several successful applications of neural networks in state-of-the-art IS. Section

6 elaborates the potential developments and challenges in future IS. Finally, section 7 concludes with a summary of the key

observations in this article.

2. DESIGN AND ARCHITECTURE OF NEURAL NETWORKS FOR DEEP LEARNING

An ANN consists of multiple levels of nonlinear modules arranged hierarchically in layers. This design is inspired by the

hierarchical information processing observed in the primate visual system [30, 31]. Such hierarchical arrangements enable deep

models to learn meaningful features at different levels of abstraction. Several successful hierarchical ANNs known as deep neural

networks (DNNs) are proposed in the literature [32]. These include convolutional neural networks [33], deep belief networks [1],

recurrent neural networks [34], and stacked auto-encoders [35]. These models extract both simple and complex features similar to

the ones witnessed in the hierarchical regions of the primate vision system. Consequently, the models show excellent performance

in solving several computer vision tasks, especially complex object recognition [36]. Cichy et al. [30] show that DNN models

mimic biological brain function. The results from their object recognition experiment suggest a close relationship between the

stages of processing in a DNN and the processing scheme observed in the human brain. In the next few sections, we overview the

most popular DNN models and their application in various vision and speech applications such as object recognition, speech

recognition, and natural language processing.

2.1 Convolutional neural networks

One of the first hierarchical models, known as convolutional neural networks (CNNs/ConvNets) [33, 37], learns hierarchical

image patterns at multiple layers with a 2D convolutional operation. CNNs are designed to process multidimensional data

structured in the form of multiple arrays or tensors. For example, a color image has three color channels represented by three 2D

arrays. Typically, CNNs process input data using three basic ideas: local connectivity, shared weights, and pooling, arranged in a

series of connected layers. A CNN architecture is shown in Fig. 1. The first few layers are convolutional and pooling layers. The

convolutional operation processes parts of the input data in small localities to take advantage of local data dependency within a

signal. The convolutional layers gradually yield more highly abstract representations of the data in deeper layers of the network.

Another aspect of the convolution operation is that filtering is repeated over the data. This maximizes use of redundant patterns in

the data.

While the convolutional layers detect local conjunctions of features from the previous layer, the role of the pooling layer is to

merge local features into a more global representation and higher level of abstraction. This helps a network become robust to small

Fig. 1. Generic architecture of CNN.

3

shifts and distortions in data. The final layers of CNN architecture are typically fully-connected neural networks that perform

classification using highly abstracted features from the previous layers. The training of all the weights in the CNN architecture is

performed by applying a regular backpropagation algorithm commonly known as gradient descent optimization algorithm.

2.2 Deep generative models and auto-encoders

The hierarchical model of CNN is designed to efficiently handle images and videos by learning meaningful features from raw

data during training. However, the major breakthrough of hierarchical models is the introduction of the ‘greedy layer-wise’ training

algorithm for deep belief networks (DBNs) proposed by Hinton et al. [28]. A DBN is built in a layer-by-layer fashion by training

each learning module known as the restricted Boltzmann machine (RBM) [38]. RBMs are composed of a visible and a hidden

layer. The visible layer represents raw data in a less abstract form, and the hidden layer is trained to represent more abstract features

by capturing correlations in the visible layer data [11]. Figure 2 (a) shows a standard architecture of a DBN. DBNs are considered

hybrid networks that do not support direct end-to-end learning. Consequently, a more efficient architecture, known as deep

Boltzmann machines (DBMs) [39], has been introduced. Similar to DBNs, DBMs are structured by stacking layers of RBMs.

However, unlike DBNs, the inference procedure of DBMs is bidirectional, allowing them to learn in the presence of more

ambiguous and challenging datasets.

The introduction of DBMs has led to the development of the stacked auto-encoder (SAE) [35, 40], which is also formed by

stacking multiple layers. Unlike DBNs, SAEs utilize auto-encoders (AE) [41] as the basic learning module. An AE is trained to

learn a copy of the input at its output. In doing so, the hidden layer learns an abstract representation of inputs in a compressed

form. Figure 2 (b) shows the architecture of a SAE. A greedy layer-wise training algorithm is used to train any of DBN, DBM, or

SAE networks, where the parameters of each layer are trained individually by keeping parameters in other layers fixed. After layer-

wise training of all layers, also known as pre-training, the layers are stacked together. The entire network with all the stacked layers

is then fine-tuned against the target output units to adjust all the parameters as illustrated in Fig. 2.

Recently variational autoencoders (VAE) and generative adversarial networks (GAN) have been introduced as generative

models for learning representations of data. VAE is a probabilistic graphical model that learns a latent variable using the variational

inference principle [42]. VAE has applications in image generation [43] and motion prediction [44, 45]. GAN is based on a

generator network challenging a discriminator network [45]. GAN has similar applications including image generation [46] and

super resolution [47]. Despite the popularity and success of GANs, they are frequently plagued by instability in training [48] and

subject to underfitting and overfitting [49]. These pitfalls may be the subject of future research into this new domain.

The three most popular deep learning domains: CNNs, AEs/SAEs and RBMs/DBNs/DBMs, have experienced a rapid growth

in research publications over the last decade for intelligent vision applications. In section 3, we discuss how these techniques have

contributed to various vision and speech related applications.

2.3 Recurrent neural networks

Another variant of neural network, known as the recurrent neural network (RNN), captures useful temporal patterns in sequential

data such as speech to augment recognition performance. An RNN architecture includes hidden layers that retain the memory of

past elements of an input sequence. Despite effectiveness in modeling sequential data, RNNs have challenges using the traditional

(a) deep belief network (DBN) (b) Stacked auto-encoder (SAE)

Fig. 2. A typical architecture including layer-wise pre-training and fine-tuning procedure of (a) deep belief network (DBN); (b) Stacked auto-encoder (SAE).

4

backpropagation technique for training with sequence of data with larger degrees of separation [50]. The long short-term memory

(LSTM) networks alleviate this shortcoming with special hidden units known as “gates” that can effectively control the scale of

information to remember or forget in the backpropagation [51]. Bidirectional RNNs consider context from the past as well as the

future to process sequential data to improve performance. This however can hinder real-time operation as the entire sequence must

be available for processing.

2.4 Attention in Neural Networks

The process of attention is an important property of human perception that greatly improves the efficacy in biological vision.

The ‘attention process’ allows humans to selectively focus on particular sections of the visual space to obtain relevant information,

avoiding the need to process the entire scene at once. Consequently, the attention provides several advantages in vision processing

[52] : drastic reduction of computation complexity due to the reduction of processing space, improved performance as the objects

of importance can always be centralized in the processing space, noise reduction or filtering by avoiding the processing of irrelevant

information in the visual scene, and selective fixations over time allows to build a contextual representation of the scene without

‘clutter’. Hence, adoption of such methodology for neural network-based vision and speech processing is highly desirable.

Early studies introduce attention by means of saliency maps: mapping of points that may contain important information in an

image [53, 54]. Introducing attention to deep learning models has been attempted more recently. A seminal study by Larochelle et

al. [55] models attention in a third order Boltzman machine that is able to accumulate information of an overall shape in an image

over several fixations. The model is only able to see a small area of an input image, and it learns by gathering information through

a sequence of fixations over parts of the image. To learn the sequence of fixations and the overall classification task, the authors

introduce a hybrid-cost for the Boltzman machine. This model shows similar performance to deep learning variants that use the

whole input image for classification. Another study [56] proposes a two-step system; first, aggressively down sample and process

the whole input image to identify candidate locations that may contain important information. Next, each location is visited by the

model in high resolution. The information collected at each location is aggregated to make final decision. Similarly, Denil et al.

[57] propose a two-pathway model for object tracking, where one focuses on object recognition and the other pathway works on

regulating the attention process.

However, ‘learning where and when to attend’ is difficult as it is highly dependent on the input and the task. It is also ill defined

in the sense that a particular sequence of fixations cannot be explicitly dictated as ground truth. Due to this, most recent studies on

deep learning with attention have employed reinforcement learning (RL) for regulating the attention aspect of the model.

Accordingly, a seminal study by Mnih et al. [52] builds a reinforcement learning policy on a two path recurrent deep learning

model to simultaneously learn the attention process and the recognition task. Based on similar principles, Gregor et al. [58] propose

a recurrent architecture for image generation. The proposed architecture uses a selective attention process to trace out lines and

generate digits similarly to a human. Another study [59] utilizes the selective attention process for image captioning. In this, the

RL based attention process learns the sequence of glimpses through the input image that best describes the scene representation.

Conversely, Mansimov et al. [60] leverages the RL based selective attention on an image caption to generate new images described

in the caption. In this, the attention mechanism learns to focus on each word in a sequential manner that is most relevant for image

generation. Despite impressive performance in learning selection attention using RL, deep RL still involves additional burdens in

developing suitable policy functions that are extremely task specific, and hence, are not generalizable. RL with deep learning also

frequently suffers from instability in training.

A different set of studies on designing neural network system are analogous to the Turing machine architecture that suggests the

use of attention process for interacting with external memory of the overall system. In this, the process of attention is implemented

using a neural controller and a memory matrix [61]. The attentional focusing allows selectivity of access necessary for memory

Figure 3. Search for articles showing increasing prominence of deep learning techniques

5

control [61]. The neural Turing machine work further explored in [62] considering attention based global and local focus on an

input sequence for machine translation. In [63] an attention mechanism is combined with a bidirectional LSTM network for speech

recognition. In [64] the authors inspired by LSTM for NLP add a trust gate to augment LSTM for applications in human skeleton

based action recognition. To highlight the growing interest in these deep learning models, Figure 3 summarizes the results of a

search for articles with different model names found in the article abstracts as of 2018.

3 DEEP LEARNING IN VISION AND SPEECH PROCESSING

This section discusses the impact of neural networks that are driving the state-of-the-art intelligent vision and speech systems.

3.1 Deep learning in computer vision

LeCun et al. [33] first introduced a CNN model to perform recognition of ten hand-written digits using image examples from

the MNIST database [65]. The proposed CNN model has shown significant performance improvement in hand-written digit

recognition task compared to earlier state-of-the-art machine learning techniques. Since then CNNs have seen several evolutions

and the current versions of CNN are tremendously successful in solving more complex and challenging image recognition tasks

[21, 36, 66, 67]. For example, Krizhevsky et al. [36] utilize a deep CNN architecture named ‘AlexNet’ for solving the ImageNet

classification challenge [68] to classify 1000 objects from high-resolution natural images. Their proposed CNN architecture has

considerably outperformed previous state-of-the-art methods during the earliest attempt with the ImageNet classification challenge.

The image recognition performance gradually improved as reported in several publications such as GoogleNet [67], VGGNet [69],

ZFNet [70] and ResNet [71], following the initial success of AlexNet. More recently, He et al. [72] have extended AlexNet to

demonstrate that a carefully trained deep CNN model is able to surpass human-level recognition performance, reported in [68] on

the ImageNet dataset. LeNet [73], AlexNet [36] and GoogLeNet [67] are three of the first CNN architectures to show significantly

improved image classification performance compared to the conventional hand-engineered computer vision models. However, a

limitation of these models is the vanishing gradient problem when increasing the number of layers to achieve more depth.

Consequently, more sophisticated CNN architecture such as ResNet [71] is proposed by incorporating “residual block” in the

architecture. The idea behind the residual block is to merge a previous layer into a future layer by forcing the network to learn

residuals. This allows the model to achieve very deep structures without causing the vanishing gradient problem. Table I shows

performance error rates of the neural networks described above for image classification. NIST represents a simpler problem, and

for ImageNet, the error decreased over time. Scene labeling represents the most challenging among the problems.

Apart from image classification, CNNs have also demonstrated state-of-the-art performance in other computer vision

applications, such as scene labeling, action recognition, and human pose estimation. Scene labeling involves assigning target

classes to multiple portions of an image based on the local content. Farabet et al. [66] proposed a scene labeling method using a

multiscale CNN that yields record accuracies on several scene labeling problem datasets with up to 170 classes. Several CNN

based models are proposed in the literature to perform human action recognition. An architectural feature called temporal pyramid

pooling is used in [74] to capture details from every frame in a video and is shown to perform action classification well with a

small training set. Another architecture, called the two-stream CNN, analyzes both spatial and temporal context independently and

gives competitive results on standard video action benchmarks [69]. CNN architectures that find pose features in an intermediate

layer have been used for action recognition. One of the more successful architectures for action recognition is called R*CNN [75]. This model uses contexts from the scene with human figure data to recognize actions. Moreover, CNNs are used in human pose

estimation, for example, Deep Pose [20] is the first CNN application to pose estimation, which has outperformed earlier methods

[76, 77]. Deep Pose is a cascaded CNN based pose estimation framework. The cascading allows the model to learn an initial pose

estimation based on the full image followed by a CNN based regressor to refine the joint predictions using higher resolution sub-

images. Tomson et al. [21] propose a ‘Spatial Model’ which incorporates a CNN architecture with Markov random field (MRF)

and offers improved results for human pose estimation. Also new sensing techniques allow efficient processing of 3D volumetric

data using 3D convolutional networks. For example in [78] human hand joint locations are estimated in real-time using a volumetric

representation of input data and a 3D convolutional network.

TABLE I SUMMARY OF THE SIGNIFICANT STATE-OF-THE-ART CNN IMAGE CLASSIFICATION RESULTS

(*ACTUAL CLASS ERROR WITHIN TOP 5 PREDICTIONS, **PIXEL CLASS ERROR)

Architecture Dataset Error rate

LeNet [73] - AT&T Bell Labs 1995 NIST (handwritten digits) 0.70%

AlexNet [36] - University of Toronto 2012 Imagenet (natural images) 17.0%*

GoogLeNet [67] - Google 2014 Imagenet (natural images) 6.67%*

ResNet [71] - Microsoft 2015 Imagenet (natural images) 3.57%*

Multiscale CNN [66] - Farabet et al. 2013 SIFT/Barcelona (scene labeling) 32.20%**

6

Other deep learning techniques, such as DBNs and SAEs, have also achieved state-of-the-art performance in various vision

related applications such as face verification [79], phone recognition [80] and emotion recognition from image and speech [81,

82]. Moreover, several studies [79, 83] have combined the advantages of different deep learning models to further boost

performance in these recognition tasks. For example, Lee et al. [83] have shown that combining convolution and weight sharing

features of CNNs with the generative architecture of DBNs offers better classification performance on benchmark datasets such as

MNIST and Caltech 101. The hybrids of CNN and DBN models, also known as CDBNs enable scaling to problems with large

images without requiring an increase in the number of parameters of the network. Table II summarizes variants of CNN

highlighting their contributions, and pros and cons. A common theme in the limitations of CNN models noted by the authors is

that these architectures can perform at human level or even better for simpler tasks. In [72], the authors note that when the images

require context to explain in image classification, there are more misclassification cases. In [74], when performing action

recognition, similar actions are challenging for the machine algorithm. In [21], the model only works well for a constrained set of

human poses. When the classification problems become very difficult such as arbitrary view, or context dependent tasks, the

architectures still have room to improve.

3.2 Deep learning in speech recognition

In addition to offering excellent performance in image recognition [21, 36, 66, 67], deep learning models have also shown state-

of-the-art performance in speech recognition [84-86]. A significant milestone is achieved in acoustic modeling research with the

aid of DBNs at multiple institutions [85]. Following the works in [28], DBNs are trained in layer-wise fashion followed by end-

to-end fine tuning for speech applications as shown in Fig. 2 above. This DBN architecture and training process has been

extensively tested on a number of large-vocabulary speech recognition datasets including TIMIT, Bing-Voice-Search speech,

Switchboard speech, Google Voice Input speech, YouTube speech, and the English-Broadcast-News speech dataset. DBNs

significantly outperform state-of-the-art methods in speech recognition when compared to highly tuned Gaussian mixture model

(GMM)-HMM. SAEs likewise are shown to outperform (GMM)-HMM on Cantonese and other speech recognition tasks [40].

RNN has succeeded in improving speech recognition performance because of its ability to learn sequential patterns as seen in

speech, language, or time-series data. RNNs have challenges in using traditional backpropagation technique for training such

models. This technique has difficulties in using memory to process portions of a sequence with larger degrees of separation [39].

The problem is addressed with the development of long short-term memory (LSTM) networks that use special hidden units known

as “gates” to retain memory over longer portions of a sequence [40]. Sak et al. [87] first studied the LSTM architecture in speech

recognition over a large vocabulary set. Their double-layer deep LSTM is found to be superior to a baseline DBN model. LSTM

has been successful in an end-to-end speech learning method, known as Deep-Speech-2 (DS2), for two largely different languages:

English and Mandarin Chinese. Other speech recognition studies using LSTM network have shown significant performance

improvement compared to previous state-of-the-art DBN based models. Furthermore, Chien et al. [88] performed an extensive

experiment with various LSTM architectures for speech recognition and compared the performance with state-of-the-art models.

TABLE II

COMPARISON OF CONVOLUTIONAL NEURAL NETWORK MODEL CONTRIBUTIONS

Architecture Application Contribution Limitations

He et al. [72]

AlexNet Variant

Image

Classification

First human level image classification performance (including fine

grained tasks e.g. 100 dog breeds differentiation) /

ReLu generalization and training

Misclassification of image cases requiring

context (therefore not totally up to human

level image understanding)

Farabet et al. [66] Multiscale CNN

Scene Labeling

Weight sharing at multiple scales to capture context without increasing number of trainable parameters / Global application of

graphical model to get consistent labels over the image

Doesn’t apply unsupervised pretraining

Wang et al. [74] Temporal Pyramid

Pooling CNN

Action Recognition

Temporal pooling for action classification in videos of arbitrary length reduces chance of overlooking important frames in decision

Challenging similar actions often misclassified

Tomson et al. [21] Joint CNN /

Graphical Model

Human Pose Estimation

Combining MRF with CNN allows prior belief about joint configurations to impact CNN body part detection

This model works well for constrained set of human poses, general space of human

poses remains a challenge

TABLE III

SUMMARY OF THE SIGNIFICANT STATE-OF-THE-ART DNN SPEECH RECOGNITION MODELS (*PERPEPLEXITY-SIZE OF MODEL NEEDED FOR OPTIMAL NEXT WORD

PREDICTION WITH 10K CLASSES, **WORD ERROR RATE)

Architecture Dataset Error rate

RNN [84] –

FIT, Czech Republic, Johns Hopkins University, 2011

Penn Corpus

(natural language modeling)

123*

Autoencoder/DBN [85] - Collaboration, 2012 English Broadcast News Speech Corpora

(spoken word recognition)

15.5%**

LSTM [87] - Google, 2014 Google Voice Search Task (spoken word recognition) 10.7%**

Deep LSTM [88]-

National Chiao Tung University, 2016

CHiME Challenge

(spoken word recognition)

8.1%**

7

To summarize key results from DBNs and RNNs (including LSTMs), Table III shows different problems and error rates achieved

by the state-of-the-art speech recognition models.

Another memory network based on RNN is proposed by Weston et al. [89] to recognize speech content. This memory network

stores pieces of information to be able to retrieve the answer related to the inquiry, making it unique and distinctive from standard

RNNs and LSTMs. RNN-based models have reached far beyond speech recognition to support natural language processing (NLP).

NLP aims to interpret language and semantics from speech or text to perform a variety of intelligent tasks, such as responding to

human speech, smart assistants (Siri, Alexa, and Cortana), analyzing sentiment to identify positive or negative attitude towards a

situation, processing events or news, and language translation in both speech and texts.

Although RNNs/LSTMs are standard in sentiment analysis, authors in [90] have proposed a novel nonlinear architecture of

multiple LSTMs to capture sentiments from phrases that constitute different order of the words in natural language. Researchers

from Google machine learning [91] have developed a machine-based language translation system that runs the Google’s popular

online translation service. Although this system has been able to reduce average error by 60% compared to the previous system, it

suffers from a few limitations. A more efficient translator is used by neural machine translator (NMT) [91] where an entire sentence

is input at one time to capture better context and meaning instead of inputting sentences by parts as in traditional methods. More

recently, a hybrid approach, combining sequential language patterns from LSTMs and hierarchical learning of images from CNNs,

has emerged to describe image content and contexts using natural language descriptions. Karpathy et al. [92] introduced this hybrid

approach for image captioning to incorporate both visual data and language descriptions to achieve optimal performance in image

captioning across several datasets. Table IV summarizes variants of RNN, their pros and cons, and contributions to state-of-the-

art speech recognition systems. For both CNNs and RNNs, the architecture is inherently driven by the problem domain. For

example: multiscale CNN to gather context for labeling across a scene [66], temporal pooling to understand actions across time

[74], MRF graphical modeling on top of CNN to form a prior belief of body poses [21], long term memory component for context

retrieval in stories, and CNN fused with RNN to interpret images using language. Hand engineered features have been replaced

but notice that effort has shifted from feature design to architectural design. This represents more flexibility to learn from large

datasets. Similar to vision tasks, the authors have noted a common theme of RNN models in speech recognition tasks that these

architectures can perform at human level or even better for simpler tasks. In [99] the authors note that the question and input stories

are rather simple for the models to handle. In [101] again the authors note that especially difficult translation problems are not

tested. As problem tasks become more complex or highly abstract, more sophisticated intelligent system is required to reach human

level performance.

3.3 Deep learning in commercial vision and speech applications

In recent years, giant companies such as Google, Facebook, Apple, Microsoft, IBM, and others have adopted deep learning as

one of their core areas of research in artificial intelligence (AI). Google Brain [93] focuses on engineering the deep learning

methods, such as tweaking CNN-based architectures, to obtain competitive recognition performance in various challenging vision

applications using a large number of cluster machines and high-end GPU-based computers. Facebook conducts extensive deep

learning research in their Facebook AI Research (FAIR) [94] lab for image recognition and natural language understanding. Many

users around the globe are already taking advantage of this recognition system in the Facebook application. Their next milestone

is to integrate the deep learning-based NLP approaches to the Facebook system to achieve near human-level performance in

understanding language. Recently, Facebook has launched a beta AI assistant system called ‘M’ [95]. ‘M’ utilizes NLP to support

more complex tasks such as purchasing items, arranging delivery of gifts, booking restaurant reservations, and making travel

arrangements, or appointments. Microsoft has investigated Cognitive toolkit [96] to show efficient ways for learning deep models

across distributed computers. They also implemented an automatic speech recognition system achieving human level

conversational speech recognition [97]. More recently, they have introduced a deep learning based speech invoked assistant called

Cortana [98]. Baidu has studied deep learning to create massive GPU systems with Infiniband [99] networks. Their speech

recognition system named Deep Speech 2 (DS2) [100] has shown remarkably improved performance over its competitors. Baidu

is also one of the pioneering research groups to introduce deep learning based self-driving cars with BMW. Nvidia has invested

efforts in developing state-of-the-art GPUs to support more efficient and real-time implementation of complex deep learning

TABLE IV

COMPARISON OF RECURRENT NEURAL NETWORK MODEL CONTRIBUTIONS

Architecture Application Contribution Limitations

Amodei et al. [100]

Gated Recurrent Unit

Network

English or Chinese

Speech Recognition

Optimized Speech Recognition using Gated

Recurrent Units for Speed of Processing

achieving near human level results

Deployment requires GPU server

Weston et al. [89]

Memory Network

Answering questions

about simple text stories

Integration of long term memory (readable and

writable) component within neural network

architecture

Questions and input stories are still rather

simple

Wu et al. [91]

Deep LSTM

Language Translation

(e.g. English-to-French)

Multi-layer LSTM with attention mechanism Especially difficult translation cases and

multi-sentence input yet to be tested

Karpathy et al. [92]

CNN/RNN Fusion

Labeling Images and

Image Regions

Use of CNN and RNN together to generate

natural language descriptions of images

Fixed image size / requires training CNN and

RNN models separately

8

models [101]. Their high-end GPUs have led to one of the most powerful end-to-end solutions for self-driving cars. IBM has

recently introduced their cognitive system known as Watson [102]. This system incorporates computer vision and speech

recognition in a human friendly interface and NLP backend. While traditional computer models have relied on rigid mathematic

principles, utilizing software built upon rules and logic, Watson instead relies on what IBM is calling “cognitive computing”. The

Watson based cognitive computing system has already proven useful across a range of different applications such as healthcare,

marketing, sales, customer service, operations, HR, and finance. Other major tech companies who are actively involved in deep

learning research include Apple [103], Amazon [104], Uber [105], and Intel [106]. Figure 4 summarizes publication statistics over

the past 10 years searching abstract for ‘deep learning’, ‘computer vision’, ‘speech recognition’, and ‘natural language processing’

methods applied for computer vision and speech processing.

Although deep learning has revolutionized today’s intelligent systems with the aid of computational resources, its applications

in more personalized settings, such as in embedded and mobile hardware systems is another challenge that has led to an active area

of research. This challenge is due to the extensive requirement of high-powered and dedicated hardware for executing the most

robust and sophisticated deep learning algorithms. Consequently, there is a growing need for developing more efficient, yet robust

deep models in resource restricted hardware environments. The next sections summarize some recent advances to develop highly

efficient deep models that are compatible in mobile hardware systems.

4 VISION AND SPEECH ON RESOURCE RESTRICTED HARDWARE PLATFORMS

The success of future vision and speech systems depends on accessibility and adaptability to a variety of platforms that eventually

drive the prospect of commercialization. While some platforms are intended for public and personal usage, there are other

commercial, industrial, and online-based platforms, all of which require seamless integration of IS. However, state-of-the-art deep

learning models have difficulties in adapting to embedded hardware due to large memory footprint, high computational complexity,

and high-power consumption. This has led to the research of improving system performance in compact architectures to enable

deployment in resource restricted platforms. The following sections highlight some of the major research efforts in integrating

sophisticated algorithms in resource restricted user platforms.

4.1 Speech recognition on mobile platforms

Handheld devices such as smartphones and tablets are ubiquitous in modern day life. Hence, a large effort in developing

intelligent systems is dedicated to mobile platforms with a view to reaching out to billions of mobile users around the world.

Speech recognition has been a pioneering application in developing smart mobile assistants. The voice input of a mobile user is

first interpreted using a speech recognition algorithm and the answer is then retrieved by an online search. The retrieved information

is then spoken out by the virtual mobile assistant. Major technology companies, such as Google [107], have enabled voice-based

content search on Android devices and a similar voice-based virtual assistant, known as Siri, is also available with Apple’s iOS

devices. This intelligent application provides mobile users with a fast and convenient hands-free feature to retrieve information.

However, mobile devices, like other embedded systems, have computational limitations and issues related to power consumption

and battery life. Therefore, mobile devices usually send input requests to a remote server to process and send the information back

to the device. This further brings in issues related to latency due to wireless network quality while connecting to the server. As an

example, Keyword spotting (KWS) [108] detects a set of previously defined keywords from speech data to enable hands-free

feature in mobile devices. The authors in [108] have proposed a low-latency keyword detection method for mobile users using a

deep learning-based technique and call it a deep KWS. The deep KWS method has not only been proven suitable for low-powered

embedded systems, but also has outperformed the baseline Hidden Markov Models for both noisy and noise-free audio data. The

deep KWS uses a fully connected DNN with transfer learning [108] based on speech recognition. The network is further optimized

for KWS with end-to-end fine-tuning using stochastic gradient descent. Sainath et al. [109] have introduced a similar small

Figure 4. Deep Learning Applications in Literature over last Decade

9

footprint KWS system based on CNNs. The authors also point out that the proposed CNN uses fewer parameters than a standard

DNN model, which makes the proposed system more attractive for platforms with resource constraints. Chen et al. [110] in another

study proposes the use of LSTM for the KWS task. The inherent recurrent connections in LSTM can make the KWS task suitable

for resource restricted platforms by improving computational efficiency. To support this, the authors further show that the proposed

LSTM outperforms a typical DNN-based KWS method. A typical framework for deep learning based KWS system is shown in

Fig. 5.

Similar to KWS systems, automatic speech recognition (ASR) [111] has become increasingly popular with mobile devices as

it alleviates the need for tedious typing on small mobile devices. Google provides ASR-based search services [107] on Android,

iOS, and Chrome platforms, and Apple iOS devices, which are equipped with conversational assistant named Siri. Mobile users

can also type texts or email by speech on both Android and iOS devices [112]. However, ASR service is contingent on the

availability of cellular mobile network since the recognition task is performed on a remote server. This is a limitation since mobile

network strength can be low, intermittent, or even absent at places. Therefore, developing an accurate speech recognition system

in real-time, embedded on standalone modern mobile devices, is still an active area of research.

Consequently, embedded speech recognition systems using DNNs have gained attention. Lei et al. [111] have achieved

substantial improvement in ASR performance over traditional gaussian mixture model (GMM) acoustic models even at a much

lower footprint and memory requirements. The authors show that a DNN model, with 1.48 million parameters, outperforms the

generic GMM-based model while exploiting only 17% of the memory used by GMM. Furthermore, the authors use a language

model compression scheme LOUDS [113] to gain further 60% improvement in the memory footprint for the proposed method.

Wang et al. [114] propose another compressed DNN-based speech recognition system that is suitable for use in resource restricted

platforms. The authors train a standard fully connected DNN model for speech recognition, compress the network using a singular

value decomposition method, and then use split vector quantization algorithms to enhance computational efficiency. The authors

achieved a 75% to 80% reduction in memory footprint lowering the memory requirement to a mere 3.2 MB. Additionally, they

achieved a 10% to 50% reduction in computational cost with performance comparable to the uncompressed version. In [115] the

authors show low-rank representation of weight matrices can increase representational power per number of parameters. They also

combine this low-rank technique with ensembles of DNN to improve performance on KWS task. Table V summarizes small

footprint speech recognition and KWS systems, which are promising for application in resource restricted platforms.

4.2 Computer vision on mobile platforms

Real-time recognition of objects or humans is an extremely desirable feature with handheld devices for convenient

authentication, identification, navigational assistance, and when combined with speech recognition, it can even be used as a mobile

teaching assistant. Though deep learning has advanced in speech recognition tasks on mobile platforms, image recognition systems

are still challenging to deploy in mobile platforms due to the resource constraints.

In a study, Sarkar et al. [116] use a deep CNN for face recognition application in mobile platform for the purpose of user

authentication. The authors first identify the disparities in hardware and software between mobile devices and typical workstations

Fig. 5. Generalized framework of a keyword spotting (KWS) system that utilizes deep learning.

TABLE V KWS ARCHITECTURES WITH REDUCED COMPUTATIONAL AND MEMORY FOOTPRINT (*RELATIVE IMPROVEMENT OVER COMPARISON NETWORK FROM ROC

CURVE, **WER (WORD ERROR RATE), ***RELATIVE FER (FRAME ERROR RATE) OVER COMPARISON NETWORK)

Compression technique Memory reduction Error rate (varied datasets)

DNN improvement over HMM, 2014 [108] 2.1M parameters 45.5% improvement*

CNN improvement over DNN, 2015 [109] 65.5K parameters 41.1% improvement*

Fixed length vector LSTM, 2015 [110] 152 K parameters 86% improvement*

Split vector quantization, 2015 [114] 59.1 MB to 3.2MB 15.8%**

Low rank matrices / ensemble training, 2016 [115] 400 nodes per layer to 100 nodes per layer

-0.174***

10

in the context of deep learning, such as the unavailability of powerful GPUs and CUDA (an application programming interface by

NVIDIA that enables general purpose processing in GPU) capabilities. The study subsequently proposes a pipeline that leverages

AlexNet [36] through transfer learning [117] for feature extraction and then uses a pool of SVM’s for scale invariant classification.

The algorithm is evaluated and compared in terms of runtime and face recognition accuracy on several mobile platforms embedded

with various Qualcomm Snapdragon CPUs and Adreo GPUs using two standard datasets, UMD-AA [118] and MOBIO [119]. The

algorithm has achieved 96% and 88% accuracies for MOBIO and UMD-AA datasets, respectively, with a minimum runtime of

5.7 seconds on the Nexus 6 mobile phone (Qualcomm Snapdragon 805 CPU with Adreno 420 GPU).

Lane et al. [120] have also performed an initial study using two popular deep learning models: CNN and fully connected deep

feed-forward networks, to analyze audio and image data on three hardware platforms: Qualcomm Snapdragon 800, Intel Edison,

and Nvidia Tegra K1 as these are commonly used in wearable and mobile devices. The study includes extensive analyses on energy

consumption, processing time, and memory footprint on these devices when running several state-of-the-art deep models for speech

and image recognition applications such as the Deep KWS, DeepEar, ImageNet [36], and SVHN [121] (street-view house number

recognition). The study identifies a critical need for optimization of these sophisticated deep models in terms of computational

complexity and memory usage for effective deployment in regular mobile platforms.

In another study, Lane et al. [122] discuss the feasibility of incorporating deep learning algorithms in mobile sensing for a

number of signal and image processing applications. They highlight the limitations that deep models for mobile applications are

still implemented on cloud-based systems rather than on standalone mobile devices due to large computational overhead. However,

the authors point out that mobile architectures have been advancing in recent years and may soon be able to accommodate complex

deep learning methods in devices. The authors subsequently implement a DNN architecture on the Hexagon DSP of a Qualcomm

Snapdragon SoC (standard CPU used in mobile phones) and compare its performance with classical machine learning algorithms

such as decision tree, SVM, and GMM in processing activity recognition, emotion recognition, and speaker identification. They

report increased robustness in performance with acceptable levels of resource use for the proposed DNN implementation in mobile

hardware.

4.3 Compact, efficient, low power deep learning for lightweight speech and vision processing

As discussed in sections 4.1 and 4.2, hardware constraints pose a major challenge in deploying the most robust deep models in

mobile hardware platforms. This has led to a recent research trend that aims to develop compressed but efficient versions of deep

models for speech and vision processing. One seminal works in this area is the development of the software platform ‘DeepX’ by

Lane et al. [123]. ‘DeepX’ is based on two resource control algorithms. First, it decomposes large deep architectures into smaller

blocks of sub-architectures and then assigns each block to the most efficient local processing unit (CPUs, GPUs, LPUs).

Furthermore, the proposed software platform is capable of dynamic decomposition and resource allocation using a resource

prediction model [123]. Deploying on two popular mobile platforms, Qualcomm Snapdragon 800 and Nvidia Tegra K1, the authors

report impressive improvements in resource use by DeepX for four state-of-the-art deep architectures: AlexNet [36], SpeakerID

[124], SVHN [125], and AudioScene [126] in object, face, character, and speaker recognition tasks, respectively [123].

Sindhwani et al. [127], on the other hand, propose a memory efficient method using a mathematical framework of structured

matrices to represent large dense matrices such as neural network parameters (weight matrices). Structured matrices, such as

Toeplitz, Vandermonde, Couchy [128], essentially utilize various parameter sharing mechanisms to represent a 𝑚 × 𝑛 matrix with

much less than 𝑚𝑛 parameters [127]. Authors also show that the use of structured matrices results in substantial improvements in

computations, especially in the matrix multiplication operations typically encountered in deep architectures, where the computation

time complexity 𝑂(𝑚𝑛), is reduced to 𝑂(𝑚 𝑙𝑜𝑔(𝑛)) [127]. This makes both forward computations and backpropagation used in

training faster and efficient. The authors test the proposed framework on a deep KWS architecture for mobile speech recognition

and compare with other similar small footprint KWS models [109]. The results show that Toeplitz based compression gives the

best model computation time, which is 80 times faster than the baseline, at the cost of only 0.4% performance degradation. Authors

also conclude that the compressed model has achieved 3.6 times reduction in memory footprint compared to the small footprint

model proposed in [109].

Han et al. [129] propose a neural network-based three-stage compression scheme known as ‘deep compression’ for reduction of

memory footprint. The first stage called pruning [130, 131] essentially removes weak connections in a DNN to obtain a sparse

network. The second stage involves trained quantization and weight sharing applied to the pruned network. The third stage uses

Huffman coding [132] for lossless data compression on the network. Authors report reduced energy consumption and a significant

computing speedup in a comparison between various workstations and mobile hardware platforms. An architecture called

TABLE VI

COMPRESSED ARCHITECTURE ENERGY AND POWER RUNNING ALEXNET ON A TEGRA GPU

Compression technique Execution time Energy consumption Implied power consumption

Benchmark study, 2015 [122] 49.1msec 232.2mJ 4.7 W (all layers)

Deep X software accelerator, 2016 [123] 866.7msec (average of 3 trials) 234.1mJ 2.7 W (all layers)

DNN various techniques, 2016 [129] 4003.8msec 5.0mJ 0.0012W (one layer)

11

ShuffleNet [133] uses two architectural features. Group convolution, introduced in [36], is used with channel shuffle architecture

in a novel way to improve efficiency of convolutional network. The group convolution improves the speed in processing images

and offers comparable performance with reduced model complexity. Table VI summarizes results from different studies for

compressed network energy consumption executing AlexNet on a Tegra GPU. Figure 6 summarizes publication statistics over the

past 5 years on small footprint analysis of deep learning methods for computer vision, speech processing, and natural language

processing in resource restricted hardware platforms.

5 EMERGING APPLICATIONS OF INTELLIGENT VISION AND SPEECH SYSTEMS

We identify three fields of research that are shifting paradigm through recent advances in vision and speech related frameworks.

First, the quantification of human behavior and expressions from visual image and speech offers great potentials in cybernetics,

security and surveillance, forensics, quantitative behavioral science, and psychology research [134]. Second, the field of

transportation research is rapidly incorporating intelligent vision systems for smart traffic management and self-driving technology.

Third, neural networks in medical image analysis shows tremendous promise for ‘precision medicine’. This represents a vast

opportunity to automate clinical measurements, optimize patient outcome predictions, and to assist physicians in clinical practice.

5.1 Intelligence in behavioral science

The field of behavioral science widely uses human annotations and qualitative screening protocols to study complex patterns in

human behavior. These traditional methods are prone to error due to high variability in human rating and qualitative nature in

behavioral information processing. Many computer vision studies on human behavior, e.g., facial expression analyses [135], can

move across disciplines to revolutionize human behavioral studies with automation and precision.

In behavioral studies, facial expressions and speech are two of the most common means to detect humans’ emotional states.

Yang et al. use quantitative analysis of vocal idiosyncrasy for screening depression severity [23]. Children with

neurodevelopmental disorders such as autism are known to have distinctive characteristics in speech and voice [24]. Hence,

computational methods for detecting differential speech features and discriminative models [25] can help in the development of

future applications to recognize emotion from voice of children with autism. Recently, deep learning frameworks have been

employed to recognize emotion from speech data promising more efficient and sophisticated applications in the future [26, 27,

136].

On the other hand, visual images from videos are used to recognize human behavioral contents [137] such as facial expressions,

head motion, human pose, and gestures to support a variety of applications for security, surveillance, and forensics [138-140] and

human-computer interactions [19]. The vision-based recognition of facial action units defined by facial action coding system

(FACS) [141] has enabled more fine-grain analysis of emotional and physiological patterns beyond prototypical facial expressions

such as happy, fear, or anger. Several commercial applications for real-time and robust facial expression and action unit level

analysis have recently appeared in the market with companies such as Noldus, Affectiva, and Emotient. With millions of facial

images available for training, state-of-the-art deep learning methods have enabled unprecedented accuracies in these commercially

available facial expression recognition applications. These applications are designed to serve a wide range of research studies

including classroom engagement analysis [142], consumer preference study in marketing [143], behavioral economics [144],

atypical facial expression analysis in neurological disorders [145], and other work in the fields of behavioral science and

psychology. The sophistication in face and facial expression analyses may unravel useful markers in diagnosing or differentiating

individuals with behavioral or affective dysfunction such as those with autism spectrum disorder. Intelligent systems for human

sentiment and expression recognition will play lead roles in developing interactive human-computer systems and smart virtual

assistants in the near future.

Fig. 6. Publications on small footprint implementations of deep learning

in computer and vision and speech processing

12

5.2 Intelligence in transportation

Intelligent transportation systems (ITS) cover a broad range of research interests including monitoring driver’s inattention [1],

providing video-based lane tracking and smart assistance to driving [2], monitoring traffic for surveillance and traffic flow

management [3], and more recently the tremendous interest in developing self-driving cars [4]. Bojarski et al. have recently used

deep learning frameworks such as CNN to obtain steering commands from raw images from a front facing camera [5]. The system

is designed to operate on highways, without lane markings, and in places with minimal visual guidance. Lane change detection [2,

6] and pedestrian detection [7] have been studied in computer vision and are recently being added as safety features in recent

personal vehicles. Similarly, computer vision assisted prediction of traffic characteristics, automatic parking, and congestion

detection may significantly ease our efforts in traffic management and safety. Sophisticated deep learning methods, such as LSTM,

are being used to predict short term traffic [6], and other deep learning frameworks are being used for predicting traffic speed and

flow [8] and for predicting driving behavior [9]. In [10] the authors suggest several aspects of transportation that will be impacted

by intelligent systems. Considering multimodal data collection from roadside sensors, RBM will be useful as they are proven to

handle multimodal data processing. Considering systems onboard vehicles, CNN can be combined with LSTM to take actions in

real-time to avoid accidents and improve vehicle efficiency. In line with these research efforts, several car manufacturing

companies are in active competition for developing next-generation self-driving vehicles with the aid of recent developments in

neural networks based deep learning techniques.

5.3 Intelligence in medicine

Despite tremendous development in medical imaging techniques, the field of medicine heavily depends on manual annotations

and visual assessment of patient’s anatomy and physiology from medical images. Clinically trained human eyes sometimes miss

important and subtle markers in medical images resulting in misdiagnosis. Misdiagnosis or even failure to diagnose early can lead

to fatal consequences as misdiagnosis is known as the third most common cause of death in the United States [146]. The

sophisticated deep learning models with the availability of massive records of multi-institutional imaging databases may ultimately

drive the future of precision medicine. Deep learning methods have been successful in medical image segmentation [11], shape

and functional measurements of organs [12], disease diagnosis [13], biomarker detection [14], patients’ survival prediction from

images [147], and many more. In addition to academic research, many commercial companies, including pioneers in medical

imaging such as Philips, Siemens, and IBM are investing on large initiatives towards incorporating deep learning methods in

intelligent medical image analysis. However, a key challenge remains with the requirement of large ground truth medical imaging

data annotated by clinical experts. With commercial initiatives, clinical and multi-institutional collaborations, deep learning-based

applications may soon be available in clinical practice.

6 LIMITATIONS OF DEEP COMPUTATIONAL MODELS

Despite unprecedented successes of neural networks in recent years, we identify a few specific areas that may greatly impact

the future progress of deep learning in IS. The first area is to develop a robust learning algorithm for deep models that requires

minimal amount of training samples.

6.1 Effect of sample size

The current deep learning models require huge amount of training examples to achieve state-of-the-art performance. However,

many application domains may lack such a massive volume of training examples such as in certain medical imaging and behavioral

analysis study. Moreover, prospective acquisition of data may also be expensive in terms of both human and computing resources.

The superior performance of deep models comes at the cost of network complexity which is often hard to optimize and prone to

overfitting without large number of samples to train hundreds and thousands of parameters. Many research studies tend to present

over optimistic performance with deep models without proper validation or proof of generalization across datasets. Some of the

solutions such as data augmentation [148, 149], transfer learning [150], and introduction of Bayesian concepts [151, 152] have laid

the groundwork for using small data, which we expect to progress over time. The second potential future direction in deep learning

research may involve improving the architectures to efficiently handle high dimensional imaging data. In medical imaging,

cardiovascular imaging involves time-sampled 3D image of the heart as 4D data. Videos of 3D models and 3D point cloud data

involves processing of large volume of data. The current deep CNN models are primarily designed to handle 2D images. Deep

models are often extended to handle 3D volumes by either converting the information to 2D sequences or utilizing dimensionality

reduction techniques in the preprocessing stage. This, in turn, results in loss of important information in volume data that may be

vital for the analysis. Therefore, a carefully designed deep learning architecture that is capable of efficiently handling raw 3D data

similar to their 2D counterparts is highly desirable. Finally, an emerging deep learning research area involves achieving high

efficiency for data intensive applications. However, they require careful selection of models, and model parameters to ensure model

robustness.

6.2 Computational burden on mobile platforms

The computational burden of deep models is one of the major constraints to overcome in making deep models as ubiquitous as

internet of things or to embed in wearable or mobile devices without the connectivity to remote server. Current state-of-the-art

deep learning models utilize enormous amount of hardware resources, which prohibit deploying them in most practical

environments. As discussed in sections 4.1-4.3, we believe that improvements in efficiency and memory footprints may enable

13

seamless utilization of mobile and wearable devices. An emerging deep learning research area involves achieving real-time learning

in memory-constrained applications. Such real-time operation will require careful selection of learning models, model

parameterization and sophisticated hardware-software co-design among others.

6.3 Interpretability of models

The complexity in network architecture has been a critical factor in providing useful interpretations of model outcomes. In most

applications, deep models are used as ‘black-box’ and optimized using heuristic methods for different tasks. For example, dropout

has been introduced to combat model overfitting [151, 153] which is essentially deactivating a number of neurons at random

without learning which neuros and weights are truly important to optimize the network performance. More importantly, the

importance of input features and the inner working principles are not well understood in deep models. Though, there has been

some progress to understand the theoretical underpinning of these networks [175], more work needs to be done. [154].

6.4 Pitfalls of over-optimism

In a few applications such as in the game of GO, deep models have outperformed human performance [155] and that has led to

the notion that intelligent systems may replace human experts in future. However, the vision-based intelligent algorithms may not

be solely relied on for critical decision-making such as in clinical diagnosis without the supervision of a radiologist, especially

where human lives are at stake. While deep neural networks can perform many routine, repetitive and predictive tasks better than

human senses (such as vision) can offer, intelligent machines are unable to master many real-life inherently human level traits such

as empathy and many more. Therefore, neural networks are developing IS that may be better viewed as complementary tools to

optimize human performance and decision-making.

7 SUMMARY OF SURVEY

This paper systematically reviews the most recent progress in innovating sophisticated intelligent algorithms in vision and

speech, their applications, and their limitations in implementation on most popular mobile and embedded devices. The rapid

evolution and success of deep learning algorithms has pioneered many new applications and commercial initiatives pertaining to

intelligent vision and speech systems, which in turn are improving our daily lives. Despite tremendous success and performance

gain of deep learning algorithms, there remain substantial challenges in implementing standalone vision and speech applications

on mobile and resource constrained devices. Future research efforts will reach out to billions of mobile phone users with the most

sophisticated deep learning-based intelligent systems. From sentiment and emotion recognition to developing self-driving

intelligent transportation systems, there is a long list of vision and speech applications that will gradually automate and assist

human’s visual and auditory perception to a greater scale and precision. With an overview of emerging applications across many

disciplines such as behavioral science, psychology, transportation, and medicine, this paper serves as an excellent foundation for

researchers, practitioners, and application developers and users.

The key observations for this survey paper are summarized as below. First, we provide an overview of different state-of-the-art

DNN algorithms and architectures in vision and speech applications. Several variants of CNN models [36, 66-72] are proposed to

address critical challenges related to vision related recognition. Currently, CNN is one of the successful and dynamic areas of

research and is dominating state-of-the-art vision systems both in the industry and academia. In addition, we briefly survey several

other pioneering DNN architectures, such as DBNs, DBMs, and SAEs in vision and speech recognition applications. RNN models

are leading the current speech recognition systems, especially in the emerging applications of NLP. Several revolutionary variants

of RNN such as the non-linear structure of LSTM [34, 88] and the hybrid CNN-LSTM architecture [156] have made substantial

improvements in the field of intelligent speech recognition and automatic image captioning.

Second, we address several challenges for state-of-the-art neural networks in adapting to compact and mobile platforms. Despite

tremendous success in performance, the state-of-the-art intelligent algorithms entail heavy computation, memory usage, and power

consumption. Studies on embedded intelligent systems, such as speech recognition and keyword spotting, are focused on adapting

the most robust deep language models to resource restricted hardware available in mobile devices. Several studies [108-111, 114]

have customized DNN, CNN, and recurrent LSTM architectures with compression and quantization schemes to achieve

considerable reductions in memory and computational requirements. Similarly, recent studies on embedded computer vision

models suggest light weight, efficient deep architectures [116, 120, 122] that are capable of real-time performance on existing

mobile CPU and GPU hardware. We further identify several studies on developing computational algorithms and software systems

[126, 127, 129] that greatly augment the efficiency of contemporary deep models regardless of the recognition task. In addition,

we identify the need for further research in developing robust learning algorithms for deep models that can be effectively trained

using minimal amount of training samples. Also, more computationally efficient architecture is expected to emerge to fully

incorporate complex 3D/4D imaging data to train the deep models. Moreover, fundamental research in hardware-software co-

design is needed to address real-time learning operation for today’s memory-constrained cyber and physical systems.

Third, we identify three areas that are undergoing a paradigm shift largely driven by vision and speech-based intelligent systems.

The vision or speech-based recognition of human’s emotion and behavior is revolutionizing a range of disciplines from behavioral

science and psychology to customer research and human-computer interactions. Intelligent applications for driver’s assistant and

self-driving cars can greatly benefit from vision-based computational systems for future traffic management and services. Deep

neural networks in vision-based intelligent systems are rapidly transforming clinical research with the promise of futuristic

14

precision diagnostic tools. Finally, we highlight three limitations of deep models: pitfalls of using small datasets, hardware

constrains in mobile devices, and the danger of over optimism to replace human experts by intelligent machines.

We hope this comprehensive survey in deep neural networks for vision and speech processing will serve as a key technical

resource for future innovations and evolutions in autonomous systems.

ACKNOWLEDGMENT

The authors would like to acknowledge partial funding of this work by the National Science Foundation (NSF) through a grant

(Award# ECCS 1310353) and the National Institute of Health (NIH) through a grant (NIBIB/NIH grant# R01 EB020683). Note

the views and findings reported in this work completely belong to the authors and not the NSF or NIH.

REFERENCES

[1] Y. Dong, Z. Hu, K. Uchimura, and N. Murayama, "Driver inattention monitoring system for intelligent vehicles: A

review," 2010, vol. 12, pp. 596-614.

[2] J. C. McCall and M. M. Trivedi, "Video-based lane estimation and tracking for driver assistance: Survey, system, and

evaluation," vol. 7, ed, 2006, pp. 20-37.

[3] N. Buch, S. a. Velastin, and J. Orwell, "A Review of Computer Vision Techniques for the Analysis of Urban Traffic,"

IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 3, pp. 920-939, 2011.

[4] E. Ohn-Bar and M. M. Trivedi, "Looking at Humans in the Age of Self-Driving and Highly Automated Vehicles," IEEE

Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 90-104, 2016.

[5] M. Bojarski et al., "End to End Learning for Self-Driving Cars," arXiv:1604, pp. 1-9, 2016.

[6] H. Woo et al., "Lane-Change Detection Based on Vehicle-Trajectory Prediction," IEEE Robotics and Automation Letters,

vol. 2, no. 2, pp. 1109-1116, 2017.

[7] W. Ouyang, X. Zeng, and X. Wang, "Single-pedestrian detection aided by two-pedestrian detection," IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1875-1889, 2015.

[8] W. Huang, G. Song, H. Hong, and K. Xie, "Deep Architecture for Traffic Flow Prediction: Deep Belief Networks With

Multitask Learning," IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 5, pp. 2191-2201, 2014.

[9] X. Wang, R. Jiang, L. Li, Y. Lin, X. Zheng, and F.-Y. Wang, "Capturing Car-Following Behaviors by Deep Learning,"

IEEE Transactions on Intelligent Transportation Systems, pp. 1-11, 2017.

[10] A. Ferdowsi, U. Challita, and W. Saad, "Deep Learning for Reliable Mobile Edge Analytics in Intelligent Transportation

Systems: An Overview," ieee vehicular technology magazine, vol. 14, no. 1, pp. 62-70, 2019.

[11] M. Havaei et al., "Brain tumor segmentation with Deep Neural Networks," Medical Image Analysis, vol. 35, pp. 18-31,

2017.

[12] M. Nelson, A. Mandar, M. ChaRandle Jordan, and M. Atif Qasim, "An End-to-End Computer Vision Pipeline for

Automated Cardiac Function Assessment by Echocardiography," arXiv, vol. 1706.07342, pp. 1-14.

[13] S. Liu et al., "Multimodal Neuroimaging Feature Learning for Multiclass Diagnosis of Alzheimer's Disease," IEEE

Transactions on Biomedical Engineering, vol. 62, no. 4, pp. 1132-1140, 2015.

[14] E. Putin et al., "Deep biomarkers of human aging: Application of deep neural networks to biomarker development,"

Aging, vol. 8, no. 5, pp. 1021-1033, 2016.

[15] M. R. Alam, M. B. I. Reaz, and M. A. M. Ali, "A review of smart homes—Past, present, and future," IEEE Transactions

on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1190-1203, 2012.

[16] R. S. Cooper, J. F. McElroy, W. Rolandi, D. Sanders, R. M. Ulmer, and E. Peebles, "Personal virtual assistant," ed:

Google Patents, 2011.

[17] E. W. Ngai, L. Xiu, and D. C. Chau, "Application of data mining techniques in customer relationship management: A

literature review and classification," Expert systems with applications, vol. 36, no. 2, pp. 2592-2602, 2009.

[18] S. Goswami, S. Chakraborty, S. Ghosh, A. Chakrabarti, and B. Chakraborty, "A review on application of data mining

techniques to combat natural disasters," Ain Shams Engineering Journal, pp. 1-14, 2016.

[19] S. S. Rautaray and A. Agrawal, "Vision based hand gesture recognition for human computer interaction: a survey,"

Artificial Intelligence Review, vol. 43, no. 1, pp. 1-54, 2015.

[20] A. Toshev and C. Szegedy, "Deeppose: Human pose estimation via deep neural networks," in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2014, pp. 1653-1660.

[21] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, "Joint training of a convolutional network and a graphical model for

human pose estimation," in Advances in neural information processing systems, 2014, pp. 1799-1807.

[22] S. Srivastava, A. Bisht, and N. Narayan, "Safety and security in smart cities using artificial intelligence—A review," in

Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on, 2017, pp. 130-133:

IEEE.

[23] Y. Yang, C. Fairbairn, and J. F. Cohn, "Detecting depression severity from vocal prosody," IEEE Transactions on

Affective Computing, vol. 4, no. 2, pp. 142-150, 2013.

15

[24] L. D. Shriberg, R. Paul, J. L. McSweeny, A. Klin, D. J. Cohen, and F. R. Volkmar, "Speech and prosody characteristics of

adolescents and adults with high-functioning autism and Asperger syndrome," Journal of Speech, Language, and Hearing

Research, vol. 44, no. 5, pp. 1097-1115, 2001.

[25] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and

databases," Pattern Recognition, vol. 44, no. 3, pp. 572-587, 2011.

[26] H. M. Fayek, M. Lech, and L. Cavedon, "Evaluating deep learning architectures for Speech Emotion Recognition," Neural

Networks, vol. 92, pp. 60-68, 2017.

[27] Y. Kim, H. Lee, and E. M. Provost, "Deep learning for robust feature generation in audiovisual emotion recognition," pp.

3687-3691.

[28] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural computation, vol. 18,

no. 7, pp. 1527-1554, 2006.

[29] G. E. Hinton, "Learning multiple layers of representation," Trends in cognitive sciences, vol. 11, no. 10, pp. 428-434,

2007.

[30] R. M. Cichy, A. Khosla, D. Pantazis, A. Torralba, and A. Oliva, "Comparison of deep neural networks to spatio-temporal

cortical dynamics of human visual object recognition reveals hierarchical correspondence," Scientific reports, vol. 6, pp.

1-13, 2016, Art. no. 27755.

[31] N. Kruger et al., "Deep hierarchies in the primate visual cortex: What can we learn for computer vision?," IEEE

transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1847-1871, 2013.

[32] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85-117, 2015.

[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings

of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

[34] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[35] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, "Stacked denoising autoencoders: Learning useful

representations in a deep network with a local denoising criterion," Journal of Machine Learning Research, vol. 11, no.

Dec, pp. 3371-3408, 2010.

[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in

Advances in neural information processing systems, 2012, pp. 1097-1105.

[37] M. Alam, L. Vidyaratne, and K. M. Iftekharuddin, "Novel hierarchical Cellular Simultaneous Recurrent neural Network

for object detection," in Neural Networks (IJCNN), 2015 International Joint Conference on, 2015, pp. 1-7.

[38] R. Salakhutdinov, A. Mnih, and G. Hinton, "Restricted Boltzmann machines for collaborative filtering," in Proceedings of

the 24th international conference on Machine learning, 2007, pp. 791-798: ACM.

[39] R. Salakhutdinov and G. Hinton, "Deep boltzmann machines," in Artificial Intelligence and Statistics, 2009, pp. 448-455.

[40] J. Gehring, Y. Miao, F. Metze, and A. Waibel, "Extracting deep bottleneck features using stacked auto-encoders," in

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 3377-3381: IEEE.

[41] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, "Extracting and composing robust features with denoising

autoencoders," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096-1103: ACM.

[42] D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint arXiv:1312.6114, 2013.

[43] X. Yan, J. Yang, K. Sohn, and H. Lee, "Attribute2image: Conditional image generation from visual attributes," in

European Conference on Computer Vision, 2016, pp. 776-791: Springer.

[44] J. Walker, C. Doersch, A. Gupta, and M. Hebert, "An uncertain future: Forecasting from static images using variational

autoencoders," in European Conference on Computer Vision, 2016, pp. 835-851: Springer.

[45] I. Goodfellow et al., "Generative adversarial nets," in Advances in neural information processing systems, 2014, pp. 2672-

2680.

[46] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, "Generative adversarial text to image synthesis," arXiv

preprint arXiv:1605.05396, 2016.

[47] C. Ledig et al., "Photo-realistic single image super-resolution using a generative adversarial network," arXiv preprint,

2017.

[48] M. Arjovsky and L. Bottou, "Towards principled methods for training generative adversarial networks," arXiv preprint

arXiv:1701.04862, 2017.

[49] I. Goodfellow, "NIPS 2016 tutorial: Generative adversarial networks," arXiv preprint arXiv:1701.00160, 2016.

[50] Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE

transactions on neural networks, vol. 5, no. 2, pp. 157-166, 1994.

[51] Z. C. Lipton, J. Berkowitz, and C. Elkan, "A critical review of recurrent neural networks for sequence learning," arXiv

preprint arXiv:1506.00019, pp. 1-38, 2015.

[52] V. Mnih, N. Heess, and A. Graves, "Recurrent models of visual attention," in Advances in neural information processing

systems, 2014, pp. 2204-2212.

[53] C. Koch and S. Ullman, "Shifts in selective visual attention: towards the underlying neural circuitry," in Matters of

intelligence: Springer, 1987, pp. 115-141.

[54] L. Itti, C. Koch, and E. Niebur, "A model of saliency-based visual attention for rapid scene analysis," IEEE Transactions

on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254-1259, 1998.

16

[55] H. Larochelle and G. E. Hinton, "Learning to combine foveal glimpses with a third-order Boltzmann machine," in

Advances in neural information processing systems, 2010, pp. 1243-1251.

[56] M. A. Ranzato, "On learning where to look," arXiv preprint arXiv:1405.5488, 2014.

[57] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas, "Learning where to attend with deep architectures for image

tracking," Neural computation, vol. 24, no. 8, pp. 2151-2184, 2012.

[58] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, "Draw: A recurrent neural network for image

generation," arXiv preprint arXiv:1502.04623, 2015.

[59] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, "Aligning Where to See and What to Tell: Image Captioning with Region-

Based Attention and Scene-Specific Contexts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39,

no. 12, pp. 2321-2334, 2017.

[60] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, "Generating images from captions with attention," arXiv

preprint arXiv:1511.02793, 2015.

[61] A. Graves, G. Wayne, and I. Danihelka, "Neural turing machines," arXiv preprint arXiv:1410.5401, pp. 1-26, 2014.

[62] M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv

preprint arXiv:1508.04025, pp. 1-11, 2015.

[63] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational

speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on,

2016, pp. 4960-4964: IEEE.

[64] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-based action recognition using spatio-temporal LSTM

network with trust gates," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 3007-3021,

2018.

[65] Y. LeCun, C. Cortes, and C. J. Burges, "The MNIST database of handwritten digits," ed, 1998.

[66] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Learning hierarchical features for scene labeling," IEEE transactions

on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915-1929, 2013.

[67] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2015, pp. 1-9.

[68] O. Russakovsky et al., "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol.

115, no. 3, pp. 211-252, 2015.

[69] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint

arXiv:1409.1556, pp. 1-14, 2014.

[70] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European conference on

computer vision, 2014, pp. 818-833: Springer.

[71] Z. Wu, C. Shen, and A. v. d. Hengel, "Wider or deeper: Revisiting the resnet model for visual recognition," arXiv preprint

arXiv:1611.10080, pp. 1-19, 2016.

[72] K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet

classification," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026-1034.

[73] Y. LeCun et al., "Learning algorithms for classification: A comparison on handwritten digit recognition," Neural

networks: the statistical mechanics perspective, vol. 261, p. 276, 1995.

[74] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, "Temporal pyramid pooling based convolutional neural networks for

action recognition," IEEE Trans. Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2613-2622, 2017.

[75] G. Gkioxari, R. Girshick, and J. Malik, "Contextual action recognition with r* cnn," in Proceedings of the IEEE

international conference on computer vision, 2015, pp. 1080-1088.

[76] P. F. Felzenszwalb and D. P. Huttenlocher, "Pictorial structures for object recognition," International journal of computer

vision, vol. 61, no. 1, pp. 55-79, 2005.

[77] M. A. Fischler and R. A. Elschlager, "The representation and matching of pictorial structures," IEEE Transactions on

computers, vol. 100, no. 1, pp. 67-92, 1973.

[78] L. Ge, H. Liang, J. Yuan, and D. Thalmann, "Real-time 3D hand pose estimation with 3D convolutional neural networks,"

IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 4, pp. 956-970, 2019.

[79] G. B. Huang, H. Lee, and E. Learned-Miller, "Learning hierarchical representations for face verification with

convolutional deep belief networks," in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on,

2012, pp. 2518-2525: IEEE.

[80] A.-r. Mohamed, G. Dahl, and G. Hinton, "Deep belief networks for phone recognition," in Nips workshop on deep

learning for speech recognition and related applications, 2009, vol. 1, no. 9, p. 39.

[81] G. Wen, H. Li, J. Huang, D. Li, and E. Xun, "Random Deep Belief Networks for Recognizing Emotions from Speech

Signals," Computational intelligence and neuroscience, vol. 2017, pp. 1-9, 2017.

[82] C. Huang, W. Gong, W. Fu, and D. Feng, "A research of speech emotion recognition based on deep belief network and

SVM," Mathematical Problems in Engineering, vol. 2014, pp. 1-7, 2014.

[83] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning

of hierarchical representations," in Proceedings of the 26th annual international conference on machine learning, 2009,

pp. 609-616: ACM.

17

[84] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černocký, "Strategies for training large scale neural network language

models," in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, 2011, pp. 196-201:

IEEE.

[85] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research

groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.

[86] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 8614-8618: IEEE.

[87] H. Sak, A. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale

acoustic modeling," in Fifteenth Annual Conference of the International Speech Communication Association, 2014, pp.

338-342.

[88] J.-T. Chien and A. Misbullah, "Deep long short-term memory networks for speech recognition," in Chinese Spoken

Language Processing (ISCSLP), 2016 10th International Symposium on, 2016, pp. 1-5: IEEE.

[89] J. Weston, S. Chopra, and A. Bordes, "Memory networks," arXiv preprint arXiv:1410.3916, pp. 1-15, 2014.

[90] K. S. Tai, R. Socher, and C. D. Manning, "Improved semantic representations from tree-structured long short-term

memory networks," arXiv preprint arXiv:1503.00075, pp. 1-11, 2015.

[91] Y. Wu et al., "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine

Translation," arXiv preprint arXiv:1609.08144, pp. 1-23, 2016.

[92] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128-3137.

[93] Google. (2019). Google Brain Team's Mission. Available: https://ai.google/research/teams/brain/

[94] Facebook. (2019). Facebook AI Research (FAIR). Available: https://research.fb.com/category/facebook-ai-research-fair/

[95] T. Simonite, "Facebook’s Perfect, Impossible Chatbot," MIT Technology Review, Available:

https://www.technologyreview.com/s/604117/facebooks-perfect-impossible-chatbot/

[96] Microsoft. (2019). Cognitive Toolkit. Available: https://docs.microsoft.com/en-us/cognitive-toolkit/

[97] W. Xiong et al., "Achieving human parity in conversational speech recognition," arXiv preprint arXiv:1610.05256, pp. 1-

13, 2016.

[98] Microsoft. (2019). Cortana. Available: https://www.microsoft.com/en-us/cortana

[99] I. T. Association, "Specification FAQ," Available: http://www.infinibandta.org/content/pages.php?pg=technology_faq

[100] D. Amodei et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," in International Conference

on Machine Learning, 2016, pp. 173-182.

[101] NVIDIA. (2019). Deep Learning AI. Available: https://www.nvidia.com/en-us/deep-learning-ai/

[102] IBM. (2019). Watson. Available: https://www.ibm.com/watson/

[103] A. Inc. (2019). Apple Machine Learning Journal. Available: https://machinelearning.apple.com/

[104] A. W. Services. (2019). Amazon Machine Learning. Available: https://aws.amazon.com/sagemaker

[105] U. Engineering, "Engineering More Reliable Transportation with Machine Learning and AI at Uber," Available:

https://eng.uber.com/machine-learning/

[106] Intel, "Machine Learning Offers a Path to Deeper Insight," Available:

https://www.intel.com/content/www/us/en/analytics/machine-learning/overview.html

[107] J. Schalkwyk et al., "“Your Word is my Command”: Google Search by Voice: A Case Study," in Advances in Speech

Recognition: Springer, 2010, pp. 61-90.

[108] G. Chen, C. Parada, and G. Heigold, "Small-footprint keyword spotting using deep neural networks," in Acoustics, Speech

and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4087-4091: IEEE.

[109] T. N. Sainath and C. Parada, "Convolutional neural networks for small-footprint keyword spotting," in Sixteenth Annual

Conference of the International Speech Communication Association, 2015, pp. 1478-1482.

[110] G. Chen, C. Parada, and T. N. Sainath, "Query-by-example keyword spotting using long short-term memory networks," in

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 5236-5240: IEEE.

[111] X. Lei, A. W. Senior, A. Gruenstein, and J. Sorensen, "Accurate and compact large vocabulary speech recognition on

mobile devices," in Interspeech, 2013, vol. 1, pp. 662-665.

[112] B. Ballinger, C. Allauzen, A. Gruenstein, and J. Schalkwyk, "On-demand language model interpolation for mobile speech

input," in Interspeech, 2010, pp. 1812-1815.

[113] J. Sorensen and C. Allauzen, "Unary data structures for language models," in Twelfth Annual Conference of the

International Speech Communication Association, 2011, pp. 1425-1428.

[114] Y. Wang, J. Li, and Y. Gong, "Small-footprint high-performance deep neural network-based speech recognition using

split-VQ," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp.

4984-4988: IEEE.

[115] G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vitaladevuni, "Model Compression Applied to Small-

Footprint Keyword Spotting," in INTERSPEECH, 2016, pp. 1878-1882.

[116] S. Sarkar, V. M. Patel, and R. Chellappa, "Deep feature-based face detection on mobile devices," in Identity, Security and

Behavior Analysis (ISBA), 2016 IEEE International Conference on, 2016, pp. 1-8: IEEE.

18

[117] Y. Bengio et al., "Deep learners benefit more from out-of-distribution examples," in Proceedings of the Fourteenth

International Conference on Artificial Intelligence and Statistics, 2011, pp. 164-172.

[118] M. E. Fathy, V. M. Patel, and R. Chellappa, "Face-based active authentication on mobile devices," in Acoustics, Speech

and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 1687-1691: IEEE.

[119] C. McCool and S. Marcel, "Mobio database for the ICPR 2010 face and speech competition," Idiap2009.

[120] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar, "An early resource characterization of deep

learning on wearables, smartphones and internet-of-things devices," in Proceedings of the 2015 International Workshop

on Internet of Things towards Applications, 2015, pp. 7-12: ACM.

[121] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, "Multi-digit number recognition from street view imagery

using deep convolutional neural networks," arXiv preprint arXiv:1312.6082, pp. 1-13, 2013.

[122] N. D. Lane and P. Georgiev, "Can deep learning revolutionize mobile sensing?," in Proceedings of the 16th International

Workshop on Mobile Computing Systems and Applications, 2015, pp. 117-122: ACM.

[123] N. D. Lane et al., "Deepx: A software accelerator for low-power deep learning inference on mobile devices," in

Information Processing in Sensor Networks (IPSN), 2016 15th ACM/IEEE International Conference on, 2016, pp. 1-12:

IEEE.

[124] N. Evans, Z. Wu, J. Yamagishi, and T. Kinnunen, "Automatic Speaker Verification Spoofing and Countermeasures

Challenge (ASVspoof 2015) Database," 2015.

[125] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, "Reading digits in natural images with unsupervised

feature learning," in NIPS workshop on deep learning and unsupervised feature learning, 2011, vol. 2011, no. 2, p. 5.

[126] A. Rakotomamonjy and G. Gasso, "Histogram of gradients of time-frequency representations for audio scene detection,"

arXiv preprint, pp. 1-15, 2014.

[127] V. Sindhwani, T. Sainath, and S. Kumar, "Structured transforms for small-footprint deep learning," in Advances in Neural

Information Processing Systems, 2015, pp. 3088-3096.

[128] V. Pan, Structured matrices and polynomials: unified superfast algorithms. Springer Science & Business Media, 2012.

[129] S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained

quantization and huffman coding," arXiv preprint arXiv:1510.00149, pp. 1-14, 2015.

[130] S. Wang and J. Jiang, "Learning natural language inference with LSTM," arXiv preprint arXiv:1512.08849, pp. 1-10,

2015.

[131] B. Hassibi and D. G. Stork, "Second order derivatives for network pruning: Optimal brain surgeon," in Advances in neural

information processing systems, 1993, pp. 164-171.

[132] J. Van Leeuwen, "On the Construction of Huffman Trees," in ICALP, 1976, pp. 382-410.

[133] X. Zhang, X. Zhou, M. Lin, and J. Sun, "Shufflenet: An extremely efficient convolutional neural network for mobile

devices," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848-6856.

[134] R. A. Calvo and S. D'Mello, "Affect detection: An interdisciplinary review of models, methods, and their applications,"

IEEE Transactions on Affective Computing, vol. 1, no. 1, pp. 18-37, 2010.

[135] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, "Recognizing facial expression: machine

learning and application to spontaneous behavior," in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE

Computer Society Conference on, 2005, vol. 2, pp. 568-573: IEEE.

[136] E. M. Albornoz, M. Sánchez-Gutiérrez, F. Martinez-Licona, H. L. Rufiner, and J. Goddard, "Spoken Emotion Recognition

Using Deep Learning," Springer, Cham, 2014, pp. 104-111.

[137] S. Wang and Q. Ji, "Video affective content analysis: a survey of state of the art methods," IEEE Transactions on Affective

Computing, vol. 6, no. 4, pp. 1-1, 2015.

[138] M. G. Ball, B. Qela, and S. Wesolkowski, "A review of the use of computational intelligence in the design of military

surveillance networks," in Recent Advances in Computational Intelligence in Defense and Security: Springer, 2016, pp.

663-693.

[139] R. Olmos, S. Tabik, and F. Herrera, "Automatic handgun detection alarm in videos using deep learning," Neurocomputing,

vol. 275, pp. 66-72, 2018.

[140] X. Li et al., "Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and

recognition methods," IEEE Transactions on Affective Computing, 2017.

[141] P. Ekman and W. V. Friesen, "The Facial Action Coding System," Consulting, 1978.

[142] J. Whitehill, Z. Serpell, Y. C. Lin, A. Foster, and J. R. Movellan, "The faces of engagement: Automatic recognition of

student engagement from facial expressions," IEEE Transactions on Affective Computing, vol. 5, no. 1, pp. 86-98, 2014.

[143] K. A. Leitch, S. E. Duncan, S. O'Keefe, R. Rudd, and D. L. Gallagher, "Characterizing consumer emotional response to

sweeteners using an emotion terminology questionnaire and facial expression analysis," Food Research International, vol.

76, pp. 283-292, 2015.

[144] C. F. Camerer, "Artificial intelligence and behavioral economics," in Economics of Artificial Intelligence: University of

Chicago Press, 2017.

[145] M. D. Samad, N. Diawara, J. L. Bobzien, J. W. Harrington, M. A. Witherow, and K. M. Iftekharuddin, "A Feasibility

Study of Autism Behavioral Markers in Spontaneous Facial, Visual, and Hand Movement Response Data," IEEE

Transactions on Neural Systems and Rehabilitation Engineering, vol. 26, no. 2, pp. 353-361, 2018.

19

[146] M. Daniel and M. A. Makary, "Medical error—the third leading cause of death in the US," Bmj, vol. 353, no. i2139, p.

476636183, 2016.

[147] A. Ulloa et al., "A deep neural network predicts survival after heart imaging better than cardiologists," arXiv preprint

arXiv:1811.10553, 2018.

[148] C. C. Charalambous and A. A. Bharath, "A data augmentation methodology for training machine/deep learning gait

recognition algorithms," arXiv preprint arXiv:1610.07570, pp. 1-12, 2016.

[149] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, "Understanding data augmentation for classification: when to

warp?," arXiv preprint arXiv:1609.08764, pp. 1-6, 2016.

[150] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, "Transfer learning using computational intelligence: a survey,"

Knowledge-Based Systems, vol. 80, pp. 14-23, 2015.

[151] Y. Gal and Z. Ghahramani, "Dropout as a Bayesian approximation: Representing model uncertainty in deep learning," in

international conference on machine learning, 2016, pp. 1050-1059.

[152] H. Wang and D.-Y. Yeung, "Towards bayesian deep learning: A survey," arXiv preprint arXiv:1604.01662, pp. 1-17,

2016.

[153] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, p. 436, 2015.

[154] J. Zou, T. Rui, Y. Zhou, C. Yang, and S. Zhang, "Convolutional neural network simplification via feature map pruning,"

Computers & Electrical Engineering, pp. 1-9, 2018.

[155] D. Silver et al., "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529, no. 7587, p. 484,

2016.

[156] J. Johnson, A. Karpathy, and L. Fei-Fei, "Densecap: Fully convolutional localization networks for dense captioning," in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4565-4574.