& 4 a e ÇÚÒÇÈ ÏÚËhÉÕÓ -...

Adaptive dropout for training deep neural networks

by

Jimmy Lei Ba

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical & Computer EngineeringUniversity of Toronto

c© Copyright 2014 by Jimmy Lei Ba

متلب سایت

MatlabSite.com

MatlabSite.com متلب سایت

Abstract

Adaptive dropout for training deep neural networks

Jimmy Lei Ba

Master of Applied Science

Graduate Department of Electrical & Computer Engineering

University of Toronto

2014

Recently, it was shown that deep neural networks perform very well if the activities of hidden units

are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a

method called “standout” in which a binary belief network is overlaid on a neural network and is used to

regularize of its hidden units by selectively setting activities to zero. This “adaptive dropout network”

can be trained jointly with the neural network by approximately computing local expectations of binary

dropout variables and computing derivatives using back-propagation. Interestingly, experiments suggest

that a good dropout network regularizes activities according to magnitude. When evaluated on the

MNIST and NORB datasets, we found that our method achieves lower classification error rates than other

feature learning methods, including standard dropout, and RBM. We also present the discriminative

learning results using our method on the MNIST, NORB and CIFAR-10 datasets.

ii

متلب سایت

MatlabSite.com


Acknowledgements

I would like to give thanks to my great supervisor Brendan Frey for all the support and encouragement.

iii

متلب سایت

MatlabSite.com


Contents

1 Introduction 1

2 Background 3

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Auto-encoder Model and Hidden Representation . . . . . . . . . . . . . . . . . . . 4

2.2 Learning with Error Back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4.1 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4.2 Discriminative learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Learning with Adaptive Dropout Network 7

3.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Inferring dropout probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Standout: Binary belief networks for dropout prediction . . . . . . . . . . . . . . . . . . . 8

3.4 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.5 Approximation to Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.6 Stochastic Adaptive Mixture of Local Experts . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.7 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.8 Exploratory experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.9 Choice of standout function and hyper-parameters α and β . . . . . . . . . . . . . . . . . 12

3.10 Analysis of a two-layer network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.11 Optimization and momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Experimental Results 17

4.1 Nonlinearity for feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Handwritten Digits Recognition: MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 3D object Recognition: NORB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Tiny Images Object Recogntion: CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Conclusion 25

Bibliography 26

iv

متلب سایت

MatlabSite.com


Chapter 1

Introduction

For decades, deep networks (> 1 hidden layer) with broad hidden layers and full connectivity could

not be trained to produce useful results, because of overfitting, slow convergence and other issues. One

approach that has proven to be successful for unsupervised learning of both probabilistic generative

models and auto-encoders is to train a deep network layer by layer in a greedy fashion [10]. Each layer

of connections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [9] or

backpropagation through a one-layer auto-encoder [2]. The hidden activities are used to train the next

layer. When the parameters of a deep network are initialized in this way, further fine tuning can be used

to improve the model, e.g., for classification [3]. The unsupervised feature learning algorithm in the

pre-training stage is a crucial component for achieving competitive overall performance on classification

tasks. For instance, Coates et al. [5] have shown that improved results can be obtained by mixing and

matching different unsupervised learning algorithms.

Recently, a technique called dropout was shown to significantly improve the performance of deep

neural networks on various tasks [11], including state of the art computer vision problem [13]. Dropout

consists of randomly setting hidden unit activities to zero with a probability of 0.5, during training. Each

training example can thus be viewed as providing learning gradients for a different, randomly sampled

architecture, so that the final neural network can be viewed as efficiently representing a huge ensemble

of neural networks, with good generalization capability. Experimental results on several tasks show that

dropout frequently and significantly improves the classification performance of deep architecture.

The method of introducing noise while training a feedforward network has been studied previously.

Bishop et al. [4] conclude that adding a small amount of Gaussian noise to the input is equivalent

to weight decay. Sietsma et al. [30] studied the effects of adding noise to different components of a

network, but was not able to obtain improved learning performance. More recently, Vincent et al. [34]

studied applying noise to the input of an auto-encoder for feature learning. In contrast to this previous

work, dropout is directly aimed at improving generalization performance of deep and broad architectures

by noisily corrupting hidden units. Indeed, dropout helps fully connected neural networks generalize

better. Unfortunately, training discriminative deep neural networks with no prior knowledge of spatial

structure of data yields little benefit from dropout when input data has high variations in the viewpoint

and lighting (section 2.4.1). Alternatively, unsupervised learning of deep networks, which discovers the

underlying structure, can help such discriminative tasks.

In this work, we describe a generalization of dropout, where the dropout probability for each hidden

1

متلب سایت

MatlabSite.com


Chapter 1. Introduction 2

variable is computed using a binary belief network that shares parameters with the deep network. Our

method works well both for unsupervised and supervised learning of deep networks. We present results

on the MNIST, NORB and CIFAR-10 datasets showing that our “standout” technique can learn better

feature detectors for handwritten digits and object recognition tasks.

متلب سایت

MatlabSite.com


Chapter 2

Background

This section gives a survey review on the previous methods that are fundamental for the rest of the

thesis.

2.1 Neural Networks

The idea of constructing a neural structure consisting of many neurons that model pairs of input and

output relations has captured interest across many fields. As pointed out in a book by Mackay [18],

neural network models are biologically plausible and can help us understand how the brain works. When

approaching this idea from an engineering point of view, one can then build a learning machine to perform

pattern recognition tasks.

Systems formed by a net of neurons were first formally studied in McCulloch and Pitts [20]. The

authors suggest a method of analyzing such systems through the use of Boolean algebra. Their idea of

realizing certain logical expressions using neural networks led to the consideration of certain structures

which they called random nets. The discovery of “back-propagation” in the context of neural networks

by Rumelhart et al. [28] drastically improves the learning efficiency of such models enabling them to

be used in practice. The vest modern state or the art neural network models are still learned using the

“back-propagation” algorithm.

In a neural network, its architecture is determined by the topological relationship of its neurons. The

parameters in the neural network model are the “weights” of the connections between different neurons.

We can define the activation of each neuron as a function of the weighted sum from its incoming signal:

aj = g(∑

i wiai), wi is the weight on the connection from neuron i to neuron j; where g(·) is the activationfunction. Typically, the activation function takes the form of a logistic function (eg. g(x) = 1

1+e−x ).

The modern neural network topology is often constructed with interconnected neurons that are formed

into layers. The first layer is usually a set of neurons directly forwarding a vector of input features. The

neurons in the last layer compute a vector of output prediction. There is commonly one or more hidden

layers in between. The weights are initially set with random values. They are adjusted by learning

algorithms to improve the performance of the neural network.

A loss function L(w) is associated with a neural network as an evaluation function for its performance.

Commonly, we model the deviation between the prediction of a neural network and a target value under

the squared error loss function, L(w) =∑

t ||p(xt)− yt||2. Here, p(xt) is the model prediction given the

3

متلب سایت

MatlabSite.com


Chapter 2. Background 4

Figure 2.1: An auto-encoder network.

tth input xt; and yt is its corresponding target value. For classification problems, the cross-entropy loss

should be used since it is more robust against outliers in a given training data than squared error loss.

2.1.1 Auto-encoder Model and Hidden Representation

Auto-encoder is a special type of neural network that mirrors its input to the output. The data is

provided as both the input and the target of the network. The squared error between the input and

output is used as the loss function. It is also regarded as a reconstruction error, since the prediction is

the reconstruction of its input in an auto-encoder network.

It is useful to decompose an auto-encoder into its recognition and generation stages. An auto-encoder

transforms the input vector x into the hidden layer representation h using its recognition weights. This

hidden representation has many properties related to other models. It is recognized early that PCA

could be implemented using a linear auto-encoder [1]. It is also shown that factor analysis and Gaussian

mixture models can be represented using auto-encoders under certain loss functions by Roweis and

Ghahramani [27]. Learning a hidden representation has an utterly important role in machine learning

after the discovery of the greedy layer-by-layer pre-training technique [10]. Much of the recent machine

learning research has been devoted to learning useful hidden representations, especially the recognition

algorithm. Given a set of input data, it is suggested that learning a meaningful hidden representation

disregarding the label first could drastically improve the performance of the later classification tasks.

This can be understood that the similarity between data points for under the classification labels may

not directly translate into the similarity of the input data. For example in computer vision, objects

belonging to the same class, such as cars, could be very far away in the pixel space under the Euclidean

metric due to the variations in pose and shape of the car. However, if we can transform the objects

into a vector of hidden representation that each elements of the vector detects particular features such

as wheels, doors and windshields, then the cars will be grouped together in such hidden representation

space as they all have consist of wheels and doors.

متلب سایت

MatlabSite.com



2.2 Learning with Error Back-propagation

For many years, it has always been a challenge to train neural networks that have one or more hidden

layers. Multi-layer neural networks have many desired properties over the Perceptron models. The first

and foremost is that multi-layer neural network are universal function approximators. The addition of

the hidden layers in multi-layer neural networks overcomes the limitations of single layer networks that

are only capable of learning linearly separable patterns as it has been pointed out in [21]. It was first

proved in [7] that every continuous function, mapping intervals of real numbers to another interval of

output numbers can be approximated with arbitrarily precision by using sigmoid activation function

in a multi-layer neural network. However, the universal approximation theorem does not answer how

to adjust the weights in neural networks to improve a given loss function. It is until discovering error

back-propagation [28] that learning by differentiating a loss function with respect to weights provided

an efficient algorithm to train the neural networks. It is worth pointing out that similar ideas were also

independently published by LeCun[16] around the same time.

The central idea of error back-propagation is to compute the partial derivatives of the weights in a

neural network by applying the chain-rule repeatedly. In a layered neural network, the partial derivatives

in the former layers are accumulated products that relate to the partial derivatives of the latter layers.

In order to compute the partial derivatives of the weights in the nth layer, the partial derivatives from

the (n + 1)th layer have to be computed first. The learning rule computes a backward pass of the

partial derivatives along the network from its output. Concretely, given a loss function L, we would

like to compute the gradient with respect to the weights ∂L∂w . The algorithm starts by computing ∂L

∂hj

for the neurons at the output of a network and propagates gradient backwards through the network.

We can then apply the chain rule to compute ∂L∂wi

= ∂L∂hj

∂hj

∂wiusing the neuron hj to which weight wi

is connecting. We can repeat this procedure on all the weights in the network. Knowing the gradient

information enables us to adjust the weights using any gradient-based optimization methods to improve

the performance of neural networks.

2.3 Stochastic Gradient Descent

In recent years, machine learning research has been increasingly focused on realistic large scale learning

problems in computer vision, natural language processing and audio speech processing. When a dataset

has billions of training cases, the batch-based gradient methods become very inefficient. It may take the

gradient decent algorithm a long time just to compute a small update step for the parameters, while

stochastic gradient descent algorithms are already making good progress on the prediction performance.

The class of Robbins-Monro algorithms was first proposed in [26] and was later improved using the

momentum technique. Stochastic gradient descent and its related class of algorithms process a mini

batch of the data at every iteration. The model parameters are updated by taking small noisy gradient

steps that are estimated from the mini batch in a loss function. Often times, these algorithms are run

by randomly sampling the data bathes from the entire training data with replacement.

متلب سایت

MatlabSite.com



2.4 Deep learning

2.4.1 Unsupervised feature learning

Having good low-level features is crucial for obtaining competitive performance in classification and

other high-level machine learning tasks. Learning algorithms that can take advantage of unlabeled data

is appealing due to an increasing amount of unlabeled data. Furthermore, on more challenging datasets,

such as NORB, a fully connected discriminative neural network trained from scratch tends to perform

poorly, even with the help of dropout.

Unsupervised feature learning usually follows the following procedures: we first extract the features

using one of the unsupervised learning algorithms, such as restricted Boltzmann machine (RBM) [10],

denoising auto-encoder [34] or sparse coding [5]. The usefulness of the extracted features is then evaluated

by training a linear classifier to predict the object class from the extracted features. This process is similar

to that employed in other feature learning research [6][25].

2.4.2 Discriminative learning

In deep learning, a common practice is to use the encoder weights learned by an unsupervised learn-

ing method to initialize the early layers of a multilayer discriminative model. The back-propagation

algorithm is then used to learn the weights for the last hidden layer and fine tune the weights in the

layers before. This procedure is often referred to as “discriminative fine-tuning”. It has been shown

that fine-tuning by back-propagation can greatly improve the classification results obtained using linear

classifiers only [22]. Many of the recent advances in object classification tasks are due to discriminative

fine-tuning the weights learned by unsupervised learning algorithms [15].

متلب سایت

MatlabSite.com


Chapter 3

Learning with Adaptive Dropout

Network

A technique called dropout was recently shown to significantly improve the performance of deep neural

networks on various tasks [11], including state of the art computer vision problem [13]. Dropout consists

of randomly setting hidden unit activities to zero with a probability of 0.5, during training. Each

training example can thus be viewed as providing learning gradients for a different, randomly sampled

architecture, so that the final neural network can be viewed as representing a huge ensemble of neural

networks, with good generalization capability. Experimental results on several tasks show that dropout

frequently and significantly improves the classification performance of deep architectures.

3.1 Dropout

In a deep and broad feedforward network, a large number of overly-complicated relationships are learnt at

different rates and with different numbers of over-fit parameters, making overfitting hard to avoid. The

dropout technique [11] regularizes learning by temporarily dropping out hidden units with probability

0.5 when each training case is presented. This is equivalent to randomly setting hidden activities to

zero. This method forces the network to make predictions based on ensembles of hidden unit activities,

since any single hidden unit will not contribute to the prediction 50% of the time. As a result, dropout

is an effective way to prevent overfitting and uselessly complex co-adaptations of units.

In our notation, the activity of unit i in layer l is a(l)i and the activation function of the hidden units

is g(·). We can formally write the feedforward computation for models using dropout as:

a(l)j = g(

∑i

w(l)j,ia

(l−1)i ) ·m(l)

j , (3.1)

m(l)j ∼ Bernoulli(P ), (3.2)

where w(l)j,i is the weight on the connection between unit i in layer l − 1 and unit j in layer l. m

(l)j is a

binary random masking variable that is used to omit hidden activities from the network with probability

P , where P = 0.5 was used in [11].

7

متلب سایت

MatlabSite.com


Chapter 3. Learning with Adaptive Dropout Network 8

3.2 Inferring dropout probability

The original dropout technique uses a constant probability for omitting a unit, so naturally, a question

we considered is whether or not it may help to let this probability be different for different hidden

variables. In particular, there may be hidden variables that can individually make confident predictions

for the presence or absence of an important feature or a combination of features. Standard dropout

will ignore this confidence and drop the unit out 50% of the time. Viewed another way, suppose after

standard dropout is applied, it is found that several hidden units are highly correlated in the pre-dropout

activities. They could be combined into a single hidden unit with a lower dropout probability, freeing

up hidden units for other purposes.

During the learning phase, mask random variables m are sampled for each training data point

according to P (m) and minimize the negative log likelihood given the sample. We can view the learning

objective as minimizing an expected negative log likelihood:

minw

EP (m)[L(w|m)] =

∫m

P (m)L(w|m)dm, (3.3)

where, L(w|m) = −∑t

logP (y(t)|x(t), w,m)

We can bound the above equation (3.3) by applying Jensen’s inequality.

E[L(w)] ≥ −∑t

log

∫m

P (m)P (y(t)|x(t), w,m)dm

= −∑t

logP (y(t)|x(t), w) (3.4)

From equation (3.4), we see that we are minimizing a stochastic upper bound on a marginal negative

likelihood in which the mask random variables are integrated out. We can further minimize the gap

between the upper bound equation (3.3) and (3.4) by adjusting P (m) to be as close as to the posterior

distribution P (m|y, x). After the model is trained, we evaluate the test data points using the predictive

distribution:

P (y∗|x∗, w) =∫m

P (m)P (y∗|x∗, w,m)dm (3.5)

Notice that we could adjust the dropout probability according to be the posterior distribution of the

random masks, while learning the model parameters w. The corresponding cost function is the marginal

likelihood in equation (3.4) by summing over the possible mask configuration. However, the true posterior

is generally intractable for all of the 2N models. We need to approximate P (m|y, x) with a q(m)

distribution.

3.3 Standout: Binary belief networks for dropout prediction

To ensure that confident hidden units stand out in contrast to unconfident hidden units, which should be

dropped out more frequently, we consider using another network that decides the probability of a hidden

متلب سایت

MatlabSite.com



unit being kept (not dropped out). Namely, we can have a second network inferring the probability of

the binary random masks. We call such network “the standout network”. In fact, we can parameterize

the standout network by a set of parameters π:

P (mj = 1|{ai : i < j}) = f(∑i:i<j

πj,iai), (3.6)

where πj,i is the weight from unit i to unit j in the standout network or the adaptive dropout network;

f(·) is a sigmoidal function, f : R → [0, 1]. We use the logistic function, f(z) = 1/(1 + exp(−z)). A

unit that is highly responsive to a particular training case will have a higher probability of being kept,

whereas units that have low and moderate activities are likely be omitted through the sampling process.

This will drive units to lock onto the patterns that repeatedly appear.

The binary unitsm in the standout network can be viewed as a binary belief network that overlays the

deep network, except for α and β, which adjust the scale of the deep network parameters so that suitable

predictions can be made for the standout probabilities. Such standout network can stochastically adjust

deep network architecture given the input vector. We can also restrict the standout network that has

only one bias parameter for each unit which does not depend on the input vector, so that the network

will learn a marginal dropout probability for all hidden units.

During the sampling process, the random masking variables m alter each training example presented

to the network during the learning phase, as described below. Hidden activities that are extreme will have

less variation in the sampling process during learning. However, moderate activities will be randomly set

to zero quite frequently, leading to higher regularization, or if viewed another way, an effective increase

in the dataset through the introduction of random perturbations.

We will use the logistic function as the standout function, unless otherwise stated. β and α are

hyperparameters that adjust the bias of the standout probability and its sensitivity to the hidden activity.

To simplify the discussion below, we assume α = 1 and β = 0, but we will return to discussing these

hyper-parameters later.

3.4 Making predictions

To process an input, the stochastic feedforward process is replaced by taking the expectation of equation

(3.1):

E[a(l)j ] = g(

∑i

w(l)j,ia

(l−1)i ) · p(M (l)

j = 1), (3.7)

where p(M(l)j = 1) is computed using equation (3.6). Given that an activity is close to zero, the final

expected activation will be even closer to zero. When an activity is close to one, the expected activation

will not be penalized as much. As a result, this deterministic process pushes unit activities to zero.

3.5 Approximation to Bayesian Prediction

The standout network we introduced here induces a probability distribution over all the possible network

architectures given the weights. In a Bayesian framework, when we make prediction, we want to integrate

متلب سایت

MatlabSite.com



Figure 3.1: weights from standout network that are most responsive to each of the 10 digits, 0 to 9 (left)auto-encoder and (right) discriminative neural net

over the possible models and corresponding parameters. We show that such an integration can be

approximated as following:

p(y|x,X,Y) =∑M

∫θ

p(y|x, θ,M)p(M|θ,X,Y)p(θ|X,Y)

≈∑M

p(y|x, θ∗,M)Q(M|x, θ∗,X,Y) (3.8)

where θ is the unknown parameter that includes both W and π. M denotes a particular network

architecture. In the above approximation, the integration over the parameters θ is replaced by a single

parameter setting θ∗. This can be understood as the complicated Bayesian predictive distribution

p(y|x,X,Y) is approximated by a mixture distribution with 2N mixture components. The predictive

distribution has a complex dependency on the input vector x. We also let the mixing proportion

distribution Q depends on x in our approximation. The learning is done by minimizing the free energy

for such a mixture model, see section (3.7).

3.6 Stochastic Adaptive Mixture of Local Experts

A neural network of N hidden units can be viewed as 2N possible models, given standout vector M . Each

of the 2N models acts like a separate “expert” network that perform well for a subset of the input space.

Training all 2N models separately can easily over-fit to the data. Weight-sharing among the models

can prevent over-fitting. In that case, expert networks have a distributed representation. Therefore,

the standout network, very much like a gating network, also produces a distributed representation

to stochastically choose which expert to turn on for a given input. This means, in this distributed

representation, 2N models are chosen by N binary numbers.

The standout network partitions the input space into different regions that are suitable for each

experts. We can visualize the effect of the standout network by showing the units that output high

standout probability for one class but, not others. The standout network learns that some hidden units

are very important for one class and will tend to keep those. These hidden units are then more likely to

be dropped when the input comes from a different class, see figure (3.1).

3.7 Learning

For a specific configuration m of the mask variables, let L(m,w) denote the likelihood of a training set

or a minibatch, where w is the set of neural network parameters. It may include a prior as well. The

dependence of L on the input and output have been suppressed for notational simplicity. Given the

current dropout parameters, π, the standout network acts like a binary belief network that generates a

distribution over the mask variables for the training set or minibatch, denoted P (m|π,w). Again, we have

متلب سایت

MatlabSite.com



0 50 100 150 200 250

0

50

100

150

200

250

Standout Network weights

0 50 100 150 200 250

0

50

100

150

200

250

Neural Network weights

Figure 3.2: First layer standout network filters and neural network filters learnt from MNIST data usingour method.

suppressed the dependence on the input to the neural network. As described above, this distribution

should not be viewed as the distribution over hidden variables in a latent variable model, but as an

approximation to a Bayesian posterior distribution over model architectures.

The goal is to adjust π and w to make P (m|π,w) close to the true posterior over architectures as

given by L(m,w), while also adjusting L(m,w) so as maximize the data likelihood w.r.t. w. Since

both the approximate posterior P (m|π,w) and the likelihood L(m,w) depend on the neural network

parameters, we use a crude approximation that we found works well in practice. If the approximate

posterior were as close as possible to the true posterior, then the derivative of the free energy F (P,L)

w.r.t P would be zero and we can ignore terms of the form ∂P/∂w. So, we adjust the neural network

parameters using the approximate derivative,

−∑m

P (m|π,w) ∂

∂wlogL(m,w), (3.9)

which can be computed by sampling from P (m|π,w). For a given setting of the neural network pa-

rameters, the standout network can in principal be adjusted to be closer to the Bayesian posterior by

following the derivative of the free energy F (P,L) w.r.t. π. This is difficult in practice, so we use an

approximation where we assume the approximate posterior is correct and sample a configuration of m

from it. Then, for each hidden unit, we consider mj = 0 and mj = 1 and determine the partial contri-

bution to the free energy. The standout network parameters are adjusted for that hidden unit so as to

decrease the partial contribution to the free energy. Namely, the standout network updates are obtained

by sampling the mask variables using the current standout network, performing forward propagation in

the neural network, and computing the data likelihood. The mask variables are sequentially perturbed

by combining the standout network probability for the mask variable with the data likelihood under the

neural network, using a partial forward propagation. The resulting mask variables are used as complete

data for updating the standout network.

The above learning technique is approximate, but works well in practice and achieves models that

outperform standard dropout and other feature learning techniques, as described below.

متلب سایت

MatlabSite.com



Algorithm 1: Standout learning algorithm: alg1 and alg2

1 Notation: H(·) is Heaviside step function ;

Input: w, π, α, β

2 alg1: initialize w, π randomly; alg2: initialize w randomly, set π = w;

3 while not stopping criteria do

4 for hidden unit j = 1, 2, ... do

5 P (mj = 1|{ai : i < j}) = f(α∑

i:i<j πj,iai + β);

6 mj ∼ P (mj = 1|{ai : i < j});7 aj = mjg

(∑i:i<j wj,iai

);

8 end

9 Update neural network parameter w using ∂∂w logL(m,w);

/* alg1 */

10 for hidden unit j = 1, 2, ... do

11 tj = H(L(m,w|mj = 1)− L(m,w|mj = 0)

)12 end

13 Update standout network π using target t ;

/* alg2 */

14 Update standout network π using π ← w ;

15 end

3.8 Exploratory experiments

Here, we study different aspects of our method using MNIST digits (see below for more details). We

trained a shallow one hidden layer auto-encoder on MNIST using the approximate learning algorithm.

We can visualize the effect of the standout network by showing the units that output low dropout

probability for one class but not others. The standout network learns that some hidden units are

important for one class and tends to keep those.

We trained a shallow one hidden layer auto-encoder on MNIST using the approximate learning

algorithm. The resulting first layer filters of both the standout network and auto-encoder are shown in

figure (3.2). We noticed that the weights in both networks are very similar. It is also very expensive

to perturb hidden units one by one when training the standout network. It motivates us to share the

parameter W and π and train them together using only the M step. To account for different scales and

shifts, we set π = αw + β, where α and β are learnt. This technique works very well in practice and

yields similar results to having two separate set of parameters. Namely, for unsupervised learning on

MNIST, we obtain 153 error for shared parameters and 158 error for unshared parameters.

3.9 Choice of standout function and hyper-parameters α and β

Concretely, we found empirically that the standout network parameters trained in this way are quite

similar (although not identical) to the neural network parameters, up to an affine transformation. This

motivated our second algorithm alg2 in psuedocode(1), where the neural network parameters are trained

as described in learning section 3.7, but the standout parameters are set to an affine transformation of

the neural network parameters with hyper-parameters alpha and beta. These hyper-parameters are

متلب سایت

MatlabSite.com



determined as explained below. We found that this technique works very well in practice, for the

MNIST and NORB datasets (see below). For example, for unsupervised learning on MNIST using the

architecture described below, we obtained 153 errors for tied parameters and 158 errors for separately

learnt parameters. This tied parameter learning algorithm is used for the experiments in the rest of

the paper. In the above description of our method, we mentioned two hyper-parameters that need to

be considered: the scale parameter α and the bias parameter β. Here we explore the choice of these

parameters, by presenting some experimental results obtained by training a dropout model as described

below using MNIST handwritten digit images.

α controls the sensitivity of the standout function to the weighted sum of inputs that is used to

determine the hidden activity. In particular, α scales the weighted sum of the activities from the layer

before. In contrast, the bias β shifts the standout probability to be high or low, and ultimately controls

the sparsity of the hidden unit activities. A model with a more negative β will have most of its hidden

activities concentrated near zero.

Figure (3.3a) illustrates how choices of α and β change the dependence of the standout probability

on the input. It shows a histogram of hidden unit activities after training networks with different α’s

and β’s on MNIST images.

We also considered different forms of the standout function besides the logistic function, as shown in

figure 3.3(b). The effect of different functional forms can be observed in the histogram of the activities

after training on the MNIST images. The logistic standout function creates a sparse distribution of

activation values, whereas the functions such as 1− 4(1− σ(·))σ(·) produce a multi-modal distribution

over the activation values shown in 3.3(c).

3.10 Analysis of a two-layer network

Consider a network with a single hidden layer with linear outputs. Let yi be the ith output value in the

output vector. The squared error cost function is

J =1

2

∑i

(yi − yi)2 (3.10)

Let g(·) be the hidden variable activation function and f(·) be the standout function. Let hj =

g(∑

i uijxi) be the hidden activation and yi =∑

j wijhjMj be the output, where the u’s and the

w’s are the weights for the hidden variables and the outputs, and Mj are the binary random variables

from equation (3.2). The expected loss can be written as:

E[J ] =1

2

D∑i=1

(∑j

w2ijh

2jVar[Mj ]

+ (∑j

wijhjE[Mj ]− yi)2

), (3.11)

where the statistics of M are computed under P (M |W ). Note that Mi and Mj are conditional indepen-

dent given the input x, so that Cov(Mi,Mj |x) = 0, for i = j. We define pj = p(Mj = 1|x) = E[Mj ] =

s(∑

i uijxi).

متلب سایت

MatlabSite.com



We take the gradient of the expected loss with respect to decoder weights, wij :

∂E[J ]

∂wij= wijh

2jVar[Mj ] + hjpj

(∑k

wikhkpk − xi

)(3.12)

We also derive the gradient for encoder weights, uij .

∂E[J ]

∂uij=∑d

w2djhjVar[Mj ]g

′(hj)xi

+∑d

(∑k

wdkhkpk − xd

)wdjpjg

′(hj)xi

+∑d

w2djh

2j

[s′(pj)− 2pjs

′(pj)]xi

+∑d

(∑k

wdkhkpk − xd

)wdjhjs

′(hj)xi (3.13)

To compare, we write down the expected gradient used in the standout algorithm.

E

[∂J

∂wij

]= wijh

2jVar[Mj ] + hjpj

(∑k

wikhkpk − xi

)(3.14)

E

[∂J

∂uij

]=∑d

w2djhjVar[Mj ]g

′(hj)xi

+∑d

(∑k

wdkhkpk − xd

)wdjpjg

′(hj)xi (3.15)

Comparing equations (3.11) and (3.14), we found that the gradient of the expected loss is the same

as the expected gradient for decoder weights. Whereas for encoder weights, the expected gradient of

equation (3.14) is only the first two terms in the gradient of the expected loss from equation (3.13). We

observed that the standout algorithm computes the correct gradient for decoder weights according to

the expected loss. However, the algorithm does not properly account for the encoder weights as we can

see the difference between equation (3.13) and (3.14). In other words, the gradient updates from the

algorithm, described in the beginning of this section, does not take into account of the partial derivatives

of the parameters with respect to the standout function. Nevertheless, we ignored the extra terms in

the gradient of expected loss for computational simplicity reasons.

The key observation here is that the stochastic noise, we added to the neural network through the

standout network, acts as an adaptive regularization during training. We consider the first term in both

expected gradient update equations (3.14), (3.15). They take on the form of our model parameter W

multiplying a coefficient that is a function of the variance and hidden activation. It is equivalent to

having an quadratic penalty term in the actual cost function. Intuitively, the parameters of the model

will be penalized more if a hidden unit has more variability across the dataset. On the other hand, if

the hidden unit is confident about its prediction, it will be penalized less. This explicit terms in the

gradients corresponding to out intuition about the standout network from the previous section.

متلب سایت

MatlabSite.com



3.11 Optimization and momentum

Algorithm 2: Stochastic gradient descent with Neterov momentum

Input: θ0, v0 = 0, η0, γ0

1 initialization;

2 for k = 1, 2, ... do

3 adjust γk and ηk following some scheme;

4 δk = ∂J(θk−1+vk−1)∂θ ;

5 vk = γkvk−1 + ηkδk;

6 θk = θk−1 + vk;

7 end

We optimize the model parameters using the stochastic gradient descent with the Nesterov momen-

tum technique [19]. The Nesterov momentum method can effectively speed up learning when applied

to large models in comparison to to standard momentum. When using the Nesterov momentum, the

cost function J and derivatives ∂J∂θ are evaluated at θ + vk [24], where vk = γvk−1 + η ∂J

∂θ is the velocity

and θ is the model parameter. γ < 1 is the momentum coefficient and η is the learning rate. Nesterov

momentum takes into account the velocity in parameter space when computing updates. Therefore, it

further reduces oscillations compared to standard momentum.

We schedule the momentum coefficient γ to further speed up the learning process. γ starts at 0.5 in

the first epoch and linearly increases to 0.99. The momentum stays at 0.99 during the majority portion

of the learning and is then linearly ramped down to 0.5 during the end of learning. ***insert scheduling

plots here***

It is important for the momentum coefficient γ to be bounded. It can be shown that when γk is

bounded between 0 and 1, vk is bounded to be a finite value. Furthermore, in order for stochastic opti-

mization to converge to some local optima, the learning rate η has to satisfy the following requirements

[26]: the infinite sum of ηk summing over k shall be unbounded; the infinite sum of (ηk)2 needs be

bounded to be some finite value. These three requirements can be summarized as:

0 < γk < 1, ∀k, (3.16)

∞∑k=0

ηk = ∞, (3.17)

∞∑k=0

(ηk)2 < ∞, (3.18)

Intuitively, the equation (3.17) ensures that the optimization is able to move far away from the

initialization and is likely to fall into a basin. The equation (3.18) ensures that the parameters will

converge to the local optima instead of bouncing around the mode. It is worth mentioning that the

exponential decay scheduling used in the experiments does not satisfy the equation 3.17. Nevertheless,

the experimental results shows that such schedule perform well in practice.

متلب سایت

MatlabSite.com



0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14

16log histogram of hidden activities

AE α=0,β= +∞

dropout α=0,β=0

standout α=1,β=0

standout α=1,β=−1

standout α=1,β= +1

standout α=2,β= +2

standout α=0.5,β=1

(a)

0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14

16log histogram of hidden activities

standout σ(z)

standout −σ(z)

standout 1−4σ(z)σ(−z)

(b)

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0

various standout functions

σ(z)

1−4σ(z)σ(−z)

1−σ(z)

(c)

Figure 3.3: (a) Histogram of hidden unit activities for various choices of hyper-parameters using thelogistic standout function, including those configurations that are equivalent to dropout and no dropout-based regularization (AE). (b) Histograms of hidden unit activities for various standout functions. (c)Various standout function f(·) plots

متلب سایت

MatlabSite.com


Chapter 4

Experimental Results

We consider both unsupervised learning and discriminative learning tasks, and compare results ob-

tained using standout to those obtained using restricted Boltzmann machines (RBMs) and auto-encoders

trained using dropout, for unsupervised feature learning tasks. We also investigate classification perfor-

mance by applying standout during discriminative training using the MNIST and NORB [17] datasets.

4.1 Nonlinearity for feedforward network

We used the ReLU [23] activation function for all the of results reported here, both on unsupervised

and discriminative tasks. The ReLU function can be written as g(x) = max(0, x). We found that its

use significantly speeds up training by up to 10-fold, compared to the commonly used logistic activation

function. The speed-up we observed can be explained in two ways: first, computations are saved from

the simple max operation to compute ReLU compared to the exponential function that needs to be

applied to compute the logistic function. Also, ReLUs do not suffer from the vanishing gradient problem

that logistic functions have for very large inputs.

4.2 Computation time

We used the publicly available gnumpy library [32] to implement our models. The models mentioned

in this work are trained on a single Nvidia GTX 580 GPU. As in psuedocode(1), the first algorithm is

relatively slow, since the number of computations is O(n2) where n is the number of hidden units. The

second algorithm is much faster and takes O(kn) time, where k is the number of configurations of the

hyper-parameters alpha and beta that are searched over. In particular, for a 784-1000-784 auto-encoder

model with mini-batches of size 100 and 50,000 training cases on a GTX 580 GPU, learning takes 1.66

seconds per epoch for standard dropout and 1.73 seconds for our second algorithm.

Model Regularization Seconds per epoachAE 784-1000-784 dropout 1.66AE 784-1000-784 standout 1.73

Figure 4.1: Computation time for mini-batches of 100 with 50,000 MNIST training cases

The computational cost of the improved representations produced by our algorithm is that a hyper

17

متلب سایت

MatlabSite.com


Chapter 4. Experimental Results 18

0 50 100 150 200 250

0

50

100

150

200

250

MNIST dataset

Figure 4.2: Random images from MNIST dataset

parameter search is needed. We note that some other recently developed dropout-related methods, such

as maxout, also involve an additional computational factor.

4.3 Handwritten Digits Recognition: MNIST

The MNIST handwritten digit dataset is generally considered as a well-studied problem, which offers

the ability to ensure that new algorithms produce sensible results when compared to the many other

techniques that have been benchmarked. It consists of ten classes of handwritten digits, ranging from 0

to 9. There are, in total, 60,000 training images and 10,000 test images. Each image is 28 × 28 pixels

in size. Following the common convention, we randomly separate the original training set into 50,000

training cases and 10,000 cases used to validate the choice of hyper-parameters. We concatenate all the

pixels in an image in a raster scan fashion to create a 784-dimensional vector. The task is to predict the

10 class labels from the 784-dimensional input vector.

We show that standout networks can extract meaningful features in a self-taught fashion. The

features extracted using the standout network not only outperformed other off-the-shelf feature learning

methods but also it is very computational efficient comparing to more complicated methods like sparse

coding.

We use the following procedures for feature learning. We first extract the features using one of the

unsupervised learning algorithms in figure (4.4). The usefulness of the extracted features were then

evaluated by training a linear classifier to predict the object class from the extracted features. This

process is similar to that employed in other feature learning research [25].

متلب سایت

MatlabSite.com



arch. act. func. err.raw pixel 784 7.2%

RBM784-1000

σ(·) 1.81%weight decay

DAE 784-1000-784 ReLU(·) 1.95%dropout 784-1000-784

ReLU(·) 1.70%AE 50% hidden dropoutstandout 784-1000-784

ReLU(·) 1.53%AE standout

error rateRBM + FT 1.24%DAE + FT 1.3%shallow dropout AE + FT 1.10%deep dropout AE + FT 0.89%standout shallow AE + FT 1.06%standout deep AE + FT 0.80%

We trained a number of architectures on MNIST, including standard auto-encoders, dropout auto-

encoders and standout auto-encoders. As described previously, we compute the expected value of each

hidden activity and use that as the feature when training a classifier. We also examined RBM’s, where

we the soft probability for each hidden unit was used as a feature. Different classifiers can be used to

give similar performance; we used a linear SVM because it is fast and straightforward to apply. However,

on subset of problems we tried logistic classifiers and they achieved indistinguishable classification rates.

Results for the different architectures and learning methods are compared in table 4.3. The auto-

encoder trained using our proposed standout technique with α = 1 and β = 0 performed the best on

MNIST.

In deep learning, a common practice is to use the encoder weights learnt by an unsupervised learning

method to initialize the early layers of a multilayer discriminative model. The backpropagation algorithm

is then used to learn the weights for the last hidden layer and fine tune the weights in the layers before.

This procedure is often referred to as discriminative fine-tuning. We initialized neural networks using

the models described above. The regularization method that we used for unsupervised learning (RBM,

dropout, standout) is also used for corresponding discriminative fine-tuning. For example, if a neural

network is initialized using an auto-encoder trained with standout, the neural network will also be fine

tuned using standout for all its hidden units, with the same standout function and hyper-parameters as

the auto-encoder.

During discriminative fine-tuning, we fixed the weights for all layers except the last one for the first 10

epoch, from then onwards, the weights are jointly updated updated. Congruent with previous authors,

we find that classification performance is usually improved by the use of discriminative fine tuning.

Impressively, we also found that a two-hidden-layer neural network with 1000 ReLU units in its first,

and its second hidden layers trained with standout, is able to achieve 80 errors on MNIST data after

fine tuning (error rate of 0.82%). This performance is better than the current best non-convolutional

result [11] and the training procedure is simpler.

متلب سایت

MatlabSite.com



0 100 200 300 400 500 600

0

100

200

300

400

500

600

NORB dataset

Figure 4.3: Random images from NORB dataset

4.4 3D object Recognition: NORB

The small NORB normalized-uniform dataset contains 24,300 training examples and 24,300 test exam-

ples. It consists of 50 different objects from five different classes: cars, trucks, planes, animals, and

humans. Each data point is represented by a stereo image pair of size 96 × 96 pixels. The training

and test set used different object instances, and images are created under different lighting conditions,

elevations and azimuths. In order to perform well in NORB, the recognition task demands learning

algorithms to learn features that can can be generalized to test set and be able to handle large input

dimension. This makes NORB significantly more challenging than the MNIST dataset. The objects

in the NORB dataset are 3D, under difference out-of-plane rotation, and so on. Therefore, the models

trained on NORB have to learn and store implicit representations of 3D structure, lighting and so on.

We formulate the data vector following the same “foveal” vision preprocessing technique used in [22], so

that the final training data vector has 8976 dimensions. Data points are subtracted by the mean and

divided by the standard deviation along each input dimension across the whole training set to normalize

the contrast. The goal is to predict the five class labels for the previously unseen 24,300 test examples.

The training set is separated into 20,000 for training and 4,300 for validation.

For this dataset, we initially tried to train a fully connected discriminative neural network from

scratch, but it performed poorly, even with the help of dropout. (We trained a two hidden layer neural

network on NORB to obtain 13% error rate and saw no improvement by using dropout). Such disap-

pointing performance motives us to investigate unsupervised feature learning and pre-training strategies

with standout method, using the same precedure described in the MNIST section.

متلب سایت

MatlabSite.com



arch. act. func. err.raw pixel 8976 23.6%

RBM8976-4000

σ(·) 10.6%weight decay

DAE 8976-4000-8976 ReLU(·) 9.5%dropout 8976-4000-8976

ReLU(·) 10.1%AE 50% hidden dropoutdropout 8976-4000-8976

ReLU(·) 8.9%AE * 22% hidden dropoutstandout 8976-4000-8976

ReLU(·) 7.3%AE standout

Figure 4.4: Performance of unsupervised feature learning, dropout AE * is optimized over its dropoutprobability using [31]

500 1000 1500 2000 2500 3000 3500 4000 4500

numer of hidden units

6

8

10

12

14

16

18

20

test

err

or

rate

(%

)

Classification error rate as a function of number of hidden units

DAE

Dropout AE

Deterministic standout AE

Standout AE

Figure 4.5: Classification error rate as a function of number of hidden units on NORB.

We performed extensive experiments on the NORB dataset with larger models. The hyperparameter

used for the best result was obtained by setting α = 1 and β = 1. Overall, we observed similar trends to

the ones we observed for MNIST. The standout method consistently performs better than other methods,

as shown in table (4.4).

On NORB dataset, we similarly performed discrimitive fine-tuning and observed performance gain as

on the MNIST dataset. We achieved 6.2% error rate by fine tuning the simple shallow auto-encoder from

table (4.4). Furthermore, a two-hidden-layer neural network with 4000 ReLU units in both hidden layers

that is pre-trained using standout, achieved 5.8% error rate after fine tuning. It is worth mentioning

that a small weight decay of 0.0005 is applied to this network during fine-tuning to further prevent

error rateDBN [29] 8.3%DBM [29] 7.2%third order RBM [22] 6.5%dropout shallow AE + FT 7.5%dropout deep AE + FT 7.0%standout shallow AE + FT 6.2%standout deep AE + FT 5.8%

Figure 4.6: Performance of fine-tuned classifiers, where FT is fine-tuning

متلب سایت

MatlabSite.com



0 50 100 150 200 250 300

0

50

100

150

200

250

300

CIFAR-10 dataset

Figure 4.7: Random images from CIFAR-10 dataset

overfitting. It outperforms other models that do not exploit spatial structure. As far as we know, this

result is better than any previously published results without distortion or jitter. It even outperforms

carefully designed convolutional neural networks found in [12].

Figure 4.6 reports the classification accuracy obtained by different models, including state-of-the-art

deep networks.

4.5 Tiny Images Object Recogntion: CIFAR-10

CIFAR-10 dataset [14] is a labeled subset of data collected from 80 million tiny images dataset [33]. The

dataset contains 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship,

truck. The dataset is divided into 50,000 training images and 10,000 test images. Each image is the

size of 32x32 pixels in 3 color channels. There are, in total 3072 dimension inputs associated with each

image. We did a very simple pre-processing to prepare the CIFAR-10 dataset. We subtracted the mean

and divided the standard deviation from each pixel across the whole training dataset. The test set is

pre-processed using the same mean and standard deviation computed from the training set.

Though some object classes in this dataset have a lot of variability in pose and viewing angle,

there are only 50,000 training data points. It is common to provide domain prior knowledge into

the topology of neural network models to the help object recognition task. We use a feed-forward

deep convolutional architecture for the CIFAR-10 dataset. Convolutional neural network is a special

class of neural networks that is designed to have translational invariance property. There are two

major components in a convolutional neural network: the convolution layer and the pooling layer. The

متلب سایت

MatlabSite.com



error rate2 layer NN, softmax 44%3 layer conv. net, avg. pooling, 64-64-64-10 20.3%3 layer conv. net, max pooling, 64-64-64-10 19.2%3 layer conv. net, avg. pooling, 64-64-64-10, standout 21.0%3 layer conv. net, max/avg. pooling, 64 filters + locally connected layer [11] 16.6%3 layer conv. net, max/avg. pooling, 64 filters + locally connected layer, dropout [11] 15.6%3 layer conv. net, stochastic pooling, 64 filters + fully connected layer, dropout [35] 15.16%3 layer conv. net, avg. pooling, 128-128-128-10, standout 18.4%3 layer conv. net, avg. pooling, 128-128-128-1000-10 24.5%3 layer conv. net, avg. pooling, 128-128-128-1000-10, standout 17.0%3 layer conv. net, avg. pooling, 256-256-256-1000-10, standout 15.2%

Figure 4.8: Performance of CIFAR-10 classifiers

neurons in the convolution layer are connected to a small receptive field in the layer below. At each

spatial location, there is a bank of neurons looking over the same receptive field with different weights.

These groups are repeated across all the of neurons at different location with shared weights. The

pooling layers, which are stacked on top of the convolution layer, contain neurons reporting the average

or maximum activity of the local pool of convolutional neurons. Usually in a deep convolutional neural

network, convolution and pooling layers are alternatively stacked for three stages with a total of six

layers. A fully connected hidden layer followed by a softmax layer on the last pooling layer is used to

make model prediction.

We applied standout to all the neurons in the convolution layers and the fully connected layers.

The experiment results are shown in table (4.5). The models are trained discriminatively from scratch.

Unlike the models trained in [11], we do not have a local contrast normalization after each pooling layer.

Also, all the pooling layers are the same type. We notice that standout can help reduce overfitting when

training the large models. Empirically, the classification performance increases with the number of filters

in the filter banks. We expect that the performance can be further improved through using a larger

number of filters. Additionally, contrast normalization layer such as the one used in [11] can also improve

the performance. It is worth mentioning that, in all the experiments, we found that aggressive stochastic

noise, such as a 50% dropout and standout with large positive β, works well with average pooling but not

with max pooling. The large stochastic noise makes the optimization diverge for convolutional neural

net with max pooling scheme.

4.6 Discussion

The standout method is able to outperform other feature learning methods in both datasets by a no-

ticeable margin. The stochasticity introduced by standout removes hidden units that have little effects

with high probability during the training phase. The learning process can focus on refining features and

pattern that are important to represent the data points. By inspecting the weights from auto-encoders

regularized by dropout and standout, the weights from standout auto-encoder are more refined and

shaper than those learnt by dropout. It is commonly hypothesized that sharper filters or dictionary

elements generate more meaningful features.

The effect of the number of hidden units is studied using network size {500, 1000, 1500, ..., 4500}.

متلب سایت

MatlabSite.com



Figure(4.5) shows that all algorithms can generally perform better by increasing the number of hid-

den units. One notable trend for dropout regularization is that dropout achieves significantly better

performance with a large number of hidden units since all units have an equal chance to be omitted.

Comparatively, standout can achieve similar performance with only half number of hidden units as in

dropout. This is because useful hidden units will be kept around more often, while less effective units

are dropped out.

We also believe that randomness in training is crucial in order to regularize the network. Such

regularization would not be attainable just by using a different activation function. In order to verify

this hypothesis, we simply trained a deterministic auto-encoder with hidden activation function as,

equation(3.7), using standard backpropagation algorithm. The result is shown in figure(4.5) as the

deterministic standout curve.

It is believed that sparse features can help improve the performance of linear classifiers. An auto-

encoder trained using the combination of ReLU and standout naturally generate a set of sparse features.

We indeed observed the competitive performance in figure(4.4). Here, we are interested in the whether

training a sparse auto-encoder will yield similar performance results as standout method. We applied

soft L1 penalty on the hidden units and trained an auto-encoder to match the implicit sparsity of the

standout model. The final features extracted using the sparse auto-encoder achieved 10.2% error on

NORB. We expect to improve the performance of the sparse auto-encoder by further tuning its hyper

parameters. On the other hand, the hyper parameters for standout models are easier to tune and has

little effect on the final performance. Moreover, the sparse features learnt using standout method are

also computationally efficient compared to more sophisticated encoding algorithms like the sparse coding

used in [8]. In order to find the code for data points with more than 4000 dimensions and 4000 dictionary

elements, the sparse coding algorithm quickly become impractical.

Surprisingly, shallow network with standout regularization in table(4.4) outperforms some of the

much larger and deeper networks shown in table(4.4). Some of those deeper models have three or four

times parameters than the shallow network we trained here. This particular result show that a simple

model with proper regularization can still achieve high performance compared to the more complicated

counterparts.

متلب سایت

MatlabSite.com


Chapter 5

Conclusion

In this work, we introduced an adaptive dropout algorithm, called standout, which allows different hid-

den variables to have different dropout probabilities, by overlaying the original deep network with a

probabilistic binary belief network that shares parameters. We showed that standout is useful for both

unsupervised feature learning and supervised classification tasks. Models trained using standout con-

sistently achieve better generalization performance on classification tasks, compared to existing feature

learning methods on three sets of experiments based on MNIST and NORB datasets. The results on

standout reinforce the notion that stochasticity during learning process is crucial for better generaliza-

tion and learning more useful features. These results provide additional support for the ‘regularization

by noise’ hypothesis that has been used to regularize other deep architectures, including RBMs, denois-

ing auto-encoders, and dropout. Based on our experiments on a deep auto-encoder, we find that with

proper regularization, it is possible to pre-train a stacked auto-encoder jointly from scratch, without

greedy layer-by-layer training, and obtain good results. Thus, this work points to new ways of training

deep architectures.

An obvious missing piece in this research is a good theoretical understanding of why the standout

function provides better regularization compared to the fixed dropout probability of 0.5. While we

have motivated our approach intuitively, we have not yet provided a theoretical justification for our

approach; currently, its value is based on its experimental performance. Generally, ‘regularization by

noise’ methods may be viewed as certain kinds of approximations to Bayesian learning. In this context,

dropout assumes that different hidden units have the same uncertainty a posteriori, whereas standout

allows different hidden units to have different amounts of uncertainty. In this view, the overlayed binary

belief network can be thought of as a probabilistic approximation to a regularizing posterior distribution

over architectures. We are currently exploring this interpretation.

25

متلب سایت

MatlabSite.com


Bibliography

[1] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from

examples without local minima. Neural networks, 2(1):53–58, 1989.

[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.

Advances in neural information processing systems, 19:153, 2007.

[3] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends R© in Machine Learning,

2(1):1–127, 2009.

[4] C.M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation,

7(1):108–116, 1995.

[5] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector

quantization. In International Conference on Machine Learning, volume 8, page 10, 2011.

[6] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised

feature learning. In International Conference on Artificial Intelligence and Statistics, pages 215–223,

2011.

[7] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,

signals and systems, 2(4):303–314, 1989.

[8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In International

Conference on Machine Learning, 2010.

[9] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural

computation, 18(7):1527–1554, 2006.

[10] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks.

Science, 313(5786):504–507, 2006.

[11] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural

networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[12] Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best

multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International

Conference on, pages 2146–2153. IEEE, 2009.

[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural

networks. Advances in Neural Information Processing Systems, 25, 2012.

26

متلب سایت

MatlabSite.com


BIBLIOGRAPHY 27

[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Mas-

ter’s thesis, Department of Computer Science, University of Toronto, 2009.

[15] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff

Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised learning. arXiv

preprint arXiv:1112.6209, 2011.

[16] Yann Le Cun. Learning process in an asymmetric threshold network. In Disordered systems and

biological organization, pages 233–240. Springer, 1986.

[17] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition

with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR

2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE,

2004.

[18] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university

press, 2003.

[19] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In

Proc. 28th Int. Conf. on Machine Learning, 2011.

[20] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.

The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.

[21] Marvin Minsky and Seymour Papert. Perceptron: an introduction to computational geometry. The

MIT Press, Cambridge, expanded edition, 19:88, 1969.

[22] V. Nair and G. Hinton. 3d object recognition with deep belief nets. Advances in Neural Information

Processing Systems, 22:1339–1347, 2009.

[23] V. Nair and G.E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc.

27th International Conference on Machine Learning, pages 807–814. Omnipress Madison, WI, 2010.

[24] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).

In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.

[25] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit

invariance during feature extraction. In Proceedings of the Twenty-eight International Conference

on Machine Learning (ICML11), 2011.

[26] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathe-

matical Statistics, pages 400–407, 1951.

[27] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural compu-

tation, 11(2):305–345, 1999.

[28] David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by

back-propagating errors. Nature, 323(6088):533–536, 1986.

[29] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In

International Conference on Artificial Intelligence and Statistics. Citeseer, 2010.

متلب سایت

MatlabSite.com


BIBLIOGRAPHY 28

[30] J. Sietsma and R.J.F. Dow. Creating artificial neural networks that generalize. Neural Networks,

4(1):67–79, 1991.

[31] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine

learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968,

2012.

[32] Tijmen Tieleman. Gnumpy: an easy way to use gpu boards in python. Department of Computer

Science, University of Toronto, 2010.

[33] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data set

for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE

Transactions on, 30(11):1958–1970, 2008.

[34] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders:

Learning useful representations in a deep network with a local denoising criterion. The Journal of

Machine Learning Research, 11:3371–3408, 2010.

[35] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural

networks. arXiv preprint arXiv:1301.3557, متلب سایت.2013

MatlabSite.com


& 4 a e ÇÚÒÇÈ ÏÚËhÉÕÓ -...

Documents