& 4 a e ÇÚÒÇÈ ÏÚËhÉÕÓ -...
TRANSCRIPT
Adaptive dropout for training deep neural networks
by
Jimmy Lei Ba
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical & Computer EngineeringUniversity of Toronto
c© Copyright 2014 by Jimmy Lei Ba
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Abstract
Adaptive dropout for training deep neural networks
Jimmy Lei Ba
Master of Applied Science
Graduate Department of Electrical & Computer Engineering
University of Toronto
2014
Recently, it was shown that deep neural networks perform very well if the activities of hidden units
are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a
method called “standout” in which a binary belief network is overlaid on a neural network and is used to
regularize of its hidden units by selectively setting activities to zero. This “adaptive dropout network”
can be trained jointly with the neural network by approximately computing local expectations of binary
dropout variables and computing derivatives using back-propagation. Interestingly, experiments suggest
that a good dropout network regularizes activities according to magnitude. When evaluated on the
MNIST and NORB datasets, we found that our method achieves lower classification error rates than other
feature learning methods, including standard dropout, and RBM. We also present the discriminative
learning results using our method on the MNIST, NORB and CIFAR-10 datasets.
ii
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Acknowledgements
I would like to give thanks to my great supervisor Brendan Frey for all the support and encouragement.
iii
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Contents
1 Introduction 1
2 Background 3
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Auto-encoder Model and Hidden Representation . . . . . . . . . . . . . . . . . . . 4
2.2 Learning with Error Back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 Discriminative learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Learning with Adaptive Dropout Network 7
3.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Inferring dropout probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Standout: Binary belief networks for dropout prediction . . . . . . . . . . . . . . . . . . . 8
3.4 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Approximation to Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Stochastic Adaptive Mixture of Local Experts . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.8 Exploratory experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.9 Choice of standout function and hyper-parameters α and β . . . . . . . . . . . . . . . . . 12
3.10 Analysis of a two-layer network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.11 Optimization and momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Experimental Results 17
4.1 Nonlinearity for feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Handwritten Digits Recognition: MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 3D object Recognition: NORB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Tiny Images Object Recogntion: CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 25
Bibliography 26
iv
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 1
Introduction
For decades, deep networks (> 1 hidden layer) with broad hidden layers and full connectivity could
not be trained to produce useful results, because of overfitting, slow convergence and other issues. One
approach that has proven to be successful for unsupervised learning of both probabilistic generative
models and auto-encoders is to train a deep network layer by layer in a greedy fashion [10]. Each layer
of connections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [9] or
backpropagation through a one-layer auto-encoder [2]. The hidden activities are used to train the next
layer. When the parameters of a deep network are initialized in this way, further fine tuning can be used
to improve the model, e.g., for classification [3]. The unsupervised feature learning algorithm in the
pre-training stage is a crucial component for achieving competitive overall performance on classification
tasks. For instance, Coates et al. [5] have shown that improved results can be obtained by mixing and
matching different unsupervised learning algorithms.
Recently, a technique called dropout was shown to significantly improve the performance of deep
neural networks on various tasks [11], including state of the art computer vision problem [13]. Dropout
consists of randomly setting hidden unit activities to zero with a probability of 0.5, during training. Each
training example can thus be viewed as providing learning gradients for a different, randomly sampled
architecture, so that the final neural network can be viewed as efficiently representing a huge ensemble
of neural networks, with good generalization capability. Experimental results on several tasks show that
dropout frequently and significantly improves the classification performance of deep architecture.
The method of introducing noise while training a feedforward network has been studied previously.
Bishop et al. [4] conclude that adding a small amount of Gaussian noise to the input is equivalent
to weight decay. Sietsma et al. [30] studied the effects of adding noise to different components of a
network, but was not able to obtain improved learning performance. More recently, Vincent et al. [34]
studied applying noise to the input of an auto-encoder for feature learning. In contrast to this previous
work, dropout is directly aimed at improving generalization performance of deep and broad architectures
by noisily corrupting hidden units. Indeed, dropout helps fully connected neural networks generalize
better. Unfortunately, training discriminative deep neural networks with no prior knowledge of spatial
structure of data yields little benefit from dropout when input data has high variations in the viewpoint
and lighting (section 2.4.1). Alternatively, unsupervised learning of deep networks, which discovers the
underlying structure, can help such discriminative tasks.
In this work, we describe a generalization of dropout, where the dropout probability for each hidden
1
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 1. Introduction 2
variable is computed using a binary belief network that shares parameters with the deep network. Our
method works well both for unsupervised and supervised learning of deep networks. We present results
on the MNIST, NORB and CIFAR-10 datasets showing that our “standout” technique can learn better
feature detectors for handwritten digits and object recognition tasks.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 2
Background
This section gives a survey review on the previous methods that are fundamental for the rest of the
thesis.
2.1 Neural Networks
The idea of constructing a neural structure consisting of many neurons that model pairs of input and
output relations has captured interest across many fields. As pointed out in a book by Mackay [18],
neural network models are biologically plausible and can help us understand how the brain works. When
approaching this idea from an engineering point of view, one can then build a learning machine to perform
pattern recognition tasks.
Systems formed by a net of neurons were first formally studied in McCulloch and Pitts [20]. The
authors suggest a method of analyzing such systems through the use of Boolean algebra. Their idea of
realizing certain logical expressions using neural networks led to the consideration of certain structures
which they called random nets. The discovery of “back-propagation” in the context of neural networks
by Rumelhart et al. [28] drastically improves the learning efficiency of such models enabling them to
be used in practice. The vest modern state or the art neural network models are still learned using the
“back-propagation” algorithm.
In a neural network, its architecture is determined by the topological relationship of its neurons. The
parameters in the neural network model are the “weights” of the connections between different neurons.
We can define the activation of each neuron as a function of the weighted sum from its incoming signal:
aj = g(∑
i wiai), wi is the weight on the connection from neuron i to neuron j; where g(·) is the activationfunction. Typically, the activation function takes the form of a logistic function (eg. g(x) = 1
1+e−x ).
The modern neural network topology is often constructed with interconnected neurons that are formed
into layers. The first layer is usually a set of neurons directly forwarding a vector of input features. The
neurons in the last layer compute a vector of output prediction. There is commonly one or more hidden
layers in between. The weights are initially set with random values. They are adjusted by learning
algorithms to improve the performance of the neural network.
A loss function L(w) is associated with a neural network as an evaluation function for its performance.
Commonly, we model the deviation between the prediction of a neural network and a target value under
the squared error loss function, L(w) =∑
t ||p(xt)− yt||2. Here, p(xt) is the model prediction given the
3
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 2. Background 4
Figure 2.1: An auto-encoder network.
tth input xt; and yt is its corresponding target value. For classification problems, the cross-entropy loss
should be used since it is more robust against outliers in a given training data than squared error loss.
2.1.1 Auto-encoder Model and Hidden Representation
Auto-encoder is a special type of neural network that mirrors its input to the output. The data is
provided as both the input and the target of the network. The squared error between the input and
output is used as the loss function. It is also regarded as a reconstruction error, since the prediction is
the reconstruction of its input in an auto-encoder network.
It is useful to decompose an auto-encoder into its recognition and generation stages. An auto-encoder
transforms the input vector x into the hidden layer representation h using its recognition weights. This
hidden representation has many properties related to other models. It is recognized early that PCA
could be implemented using a linear auto-encoder [1]. It is also shown that factor analysis and Gaussian
mixture models can be represented using auto-encoders under certain loss functions by Roweis and
Ghahramani [27]. Learning a hidden representation has an utterly important role in machine learning
after the discovery of the greedy layer-by-layer pre-training technique [10]. Much of the recent machine
learning research has been devoted to learning useful hidden representations, especially the recognition
algorithm. Given a set of input data, it is suggested that learning a meaningful hidden representation
disregarding the label first could drastically improve the performance of the later classification tasks.
This can be understood that the similarity between data points for under the classification labels may
not directly translate into the similarity of the input data. For example in computer vision, objects
belonging to the same class, such as cars, could be very far away in the pixel space under the Euclidean
metric due to the variations in pose and shape of the car. However, if we can transform the objects
into a vector of hidden representation that each elements of the vector detects particular features such
as wheels, doors and windshields, then the cars will be grouped together in such hidden representation
space as they all have consist of wheels and doors.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 2. Background 5
2.2 Learning with Error Back-propagation
For many years, it has always been a challenge to train neural networks that have one or more hidden
layers. Multi-layer neural networks have many desired properties over the Perceptron models. The first
and foremost is that multi-layer neural network are universal function approximators. The addition of
the hidden layers in multi-layer neural networks overcomes the limitations of single layer networks that
are only capable of learning linearly separable patterns as it has been pointed out in [21]. It was first
proved in [7] that every continuous function, mapping intervals of real numbers to another interval of
output numbers can be approximated with arbitrarily precision by using sigmoid activation function
in a multi-layer neural network. However, the universal approximation theorem does not answer how
to adjust the weights in neural networks to improve a given loss function. It is until discovering error
back-propagation [28] that learning by differentiating a loss function with respect to weights provided
an efficient algorithm to train the neural networks. It is worth pointing out that similar ideas were also
independently published by LeCun[16] around the same time.
The central idea of error back-propagation is to compute the partial derivatives of the weights in a
neural network by applying the chain-rule repeatedly. In a layered neural network, the partial derivatives
in the former layers are accumulated products that relate to the partial derivatives of the latter layers.
In order to compute the partial derivatives of the weights in the nth layer, the partial derivatives from
the (n + 1)th layer have to be computed first. The learning rule computes a backward pass of the
partial derivatives along the network from its output. Concretely, given a loss function L, we would
like to compute the gradient with respect to the weights ∂L∂w . The algorithm starts by computing ∂L
∂hj
for the neurons at the output of a network and propagates gradient backwards through the network.
We can then apply the chain rule to compute ∂L∂wi
= ∂L∂hj
∂hj
∂wiusing the neuron hj to which weight wi
is connecting. We can repeat this procedure on all the weights in the network. Knowing the gradient
information enables us to adjust the weights using any gradient-based optimization methods to improve
the performance of neural networks.
2.3 Stochastic Gradient Descent
In recent years, machine learning research has been increasingly focused on realistic large scale learning
problems in computer vision, natural language processing and audio speech processing. When a dataset
has billions of training cases, the batch-based gradient methods become very inefficient. It may take the
gradient decent algorithm a long time just to compute a small update step for the parameters, while
stochastic gradient descent algorithms are already making good progress on the prediction performance.
The class of Robbins-Monro algorithms was first proposed in [26] and was later improved using the
momentum technique. Stochastic gradient descent and its related class of algorithms process a mini
batch of the data at every iteration. The model parameters are updated by taking small noisy gradient
steps that are estimated from the mini batch in a loss function. Often times, these algorithms are run
by randomly sampling the data bathes from the entire training data with replacement.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 2. Background 6
2.4 Deep learning
2.4.1 Unsupervised feature learning
Having good low-level features is crucial for obtaining competitive performance in classification and
other high-level machine learning tasks. Learning algorithms that can take advantage of unlabeled data
is appealing due to an increasing amount of unlabeled data. Furthermore, on more challenging datasets,
such as NORB, a fully connected discriminative neural network trained from scratch tends to perform
poorly, even with the help of dropout.
Unsupervised feature learning usually follows the following procedures: we first extract the features
using one of the unsupervised learning algorithms, such as restricted Boltzmann machine (RBM) [10],
denoising auto-encoder [34] or sparse coding [5]. The usefulness of the extracted features is then evaluated
by training a linear classifier to predict the object class from the extracted features. This process is similar
to that employed in other feature learning research [6][25].
2.4.2 Discriminative learning
In deep learning, a common practice is to use the encoder weights learned by an unsupervised learn-
ing method to initialize the early layers of a multilayer discriminative model. The back-propagation
algorithm is then used to learn the weights for the last hidden layer and fine tune the weights in the
layers before. This procedure is often referred to as “discriminative fine-tuning”. It has been shown
that fine-tuning by back-propagation can greatly improve the classification results obtained using linear
classifiers only [22]. Many of the recent advances in object classification tasks are due to discriminative
fine-tuning the weights learned by unsupervised learning algorithms [15].
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3
Learning with Adaptive Dropout
Network
A technique called dropout was recently shown to significantly improve the performance of deep neural
networks on various tasks [11], including state of the art computer vision problem [13]. Dropout consists
of randomly setting hidden unit activities to zero with a probability of 0.5, during training. Each
training example can thus be viewed as providing learning gradients for a different, randomly sampled
architecture, so that the final neural network can be viewed as representing a huge ensemble of neural
networks, with good generalization capability. Experimental results on several tasks show that dropout
frequently and significantly improves the classification performance of deep architectures.
3.1 Dropout
In a deep and broad feedforward network, a large number of overly-complicated relationships are learnt at
different rates and with different numbers of over-fit parameters, making overfitting hard to avoid. The
dropout technique [11] regularizes learning by temporarily dropping out hidden units with probability
0.5 when each training case is presented. This is equivalent to randomly setting hidden activities to
zero. This method forces the network to make predictions based on ensembles of hidden unit activities,
since any single hidden unit will not contribute to the prediction 50% of the time. As a result, dropout
is an effective way to prevent overfitting and uselessly complex co-adaptations of units.
In our notation, the activity of unit i in layer l is a(l)i and the activation function of the hidden units
is g(·). We can formally write the feedforward computation for models using dropout as:
a(l)j = g(
∑i
w(l)j,ia
(l−1)i ) ·m(l)
j , (3.1)
m(l)j ∼ Bernoulli(P ), (3.2)
where w(l)j,i is the weight on the connection between unit i in layer l − 1 and unit j in layer l. m
(l)j is a
binary random masking variable that is used to omit hidden activities from the network with probability
P , where P = 0.5 was used in [11].
7
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 8
3.2 Inferring dropout probability
The original dropout technique uses a constant probability for omitting a unit, so naturally, a question
we considered is whether or not it may help to let this probability be different for different hidden
variables. In particular, there may be hidden variables that can individually make confident predictions
for the presence or absence of an important feature or a combination of features. Standard dropout
will ignore this confidence and drop the unit out 50% of the time. Viewed another way, suppose after
standard dropout is applied, it is found that several hidden units are highly correlated in the pre-dropout
activities. They could be combined into a single hidden unit with a lower dropout probability, freeing
up hidden units for other purposes.
During the learning phase, mask random variables m are sampled for each training data point
according to P (m) and minimize the negative log likelihood given the sample. We can view the learning
objective as minimizing an expected negative log likelihood:
minw
EP (m)[L(w|m)] =
∫m
P (m)L(w|m)dm, (3.3)
where, L(w|m) = −∑t
logP (y(t)|x(t), w,m)
We can bound the above equation (3.3) by applying Jensen’s inequality.
E[L(w)] ≥ −∑t
log
∫m
P (m)P (y(t)|x(t), w,m)dm
= −∑t
logP (y(t)|x(t), w) (3.4)
From equation (3.4), we see that we are minimizing a stochastic upper bound on a marginal negative
likelihood in which the mask random variables are integrated out. We can further minimize the gap
between the upper bound equation (3.3) and (3.4) by adjusting P (m) to be as close as to the posterior
distribution P (m|y, x). After the model is trained, we evaluate the test data points using the predictive
distribution:
P (y∗|x∗, w) =∫m
P (m)P (y∗|x∗, w,m)dm (3.5)
Notice that we could adjust the dropout probability according to be the posterior distribution of the
random masks, while learning the model parameters w. The corresponding cost function is the marginal
likelihood in equation (3.4) by summing over the possible mask configuration. However, the true posterior
is generally intractable for all of the 2N models. We need to approximate P (m|y, x) with a q(m)
distribution.
3.3 Standout: Binary belief networks for dropout prediction
To ensure that confident hidden units stand out in contrast to unconfident hidden units, which should be
dropped out more frequently, we consider using another network that decides the probability of a hidden
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 9
unit being kept (not dropped out). Namely, we can have a second network inferring the probability of
the binary random masks. We call such network “the standout network”. In fact, we can parameterize
the standout network by a set of parameters π:
P (mj = 1|{ai : i < j}) = f(∑i:i<j
πj,iai), (3.6)
where πj,i is the weight from unit i to unit j in the standout network or the adaptive dropout network;
f(·) is a sigmoidal function, f : R → [0, 1]. We use the logistic function, f(z) = 1/(1 + exp(−z)). A
unit that is highly responsive to a particular training case will have a higher probability of being kept,
whereas units that have low and moderate activities are likely be omitted through the sampling process.
This will drive units to lock onto the patterns that repeatedly appear.
The binary unitsm in the standout network can be viewed as a binary belief network that overlays the
deep network, except for α and β, which adjust the scale of the deep network parameters so that suitable
predictions can be made for the standout probabilities. Such standout network can stochastically adjust
deep network architecture given the input vector. We can also restrict the standout network that has
only one bias parameter for each unit which does not depend on the input vector, so that the network
will learn a marginal dropout probability for all hidden units.
During the sampling process, the random masking variables m alter each training example presented
to the network during the learning phase, as described below. Hidden activities that are extreme will have
less variation in the sampling process during learning. However, moderate activities will be randomly set
to zero quite frequently, leading to higher regularization, or if viewed another way, an effective increase
in the dataset through the introduction of random perturbations.
We will use the logistic function as the standout function, unless otherwise stated. β and α are
hyperparameters that adjust the bias of the standout probability and its sensitivity to the hidden activity.
To simplify the discussion below, we assume α = 1 and β = 0, but we will return to discussing these
hyper-parameters later.
3.4 Making predictions
To process an input, the stochastic feedforward process is replaced by taking the expectation of equation
(3.1):
E[a(l)j ] = g(
∑i
w(l)j,ia
(l−1)i ) · p(M (l)
j = 1), (3.7)
where p(M(l)j = 1) is computed using equation (3.6). Given that an activity is close to zero, the final
expected activation will be even closer to zero. When an activity is close to one, the expected activation
will not be penalized as much. As a result, this deterministic process pushes unit activities to zero.
3.5 Approximation to Bayesian Prediction
The standout network we introduced here induces a probability distribution over all the possible network
architectures given the weights. In a Bayesian framework, when we make prediction, we want to integrate
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 10
Figure 3.1: weights from standout network that are most responsive to each of the 10 digits, 0 to 9 (left)auto-encoder and (right) discriminative neural net
over the possible models and corresponding parameters. We show that such an integration can be
approximated as following:
p(y|x,X,Y) =∑M
∫θ
p(y|x, θ,M)p(M|θ,X,Y)p(θ|X,Y)
≈∑M
p(y|x, θ∗,M)Q(M|x, θ∗,X,Y) (3.8)
where θ is the unknown parameter that includes both W and π. M denotes a particular network
architecture. In the above approximation, the integration over the parameters θ is replaced by a single
parameter setting θ∗. This can be understood as the complicated Bayesian predictive distribution
p(y|x,X,Y) is approximated by a mixture distribution with 2N mixture components. The predictive
distribution has a complex dependency on the input vector x. We also let the mixing proportion
distribution Q depends on x in our approximation. The learning is done by minimizing the free energy
for such a mixture model, see section (3.7).
3.6 Stochastic Adaptive Mixture of Local Experts
A neural network of N hidden units can be viewed as 2N possible models, given standout vector M . Each
of the 2N models acts like a separate “expert” network that perform well for a subset of the input space.
Training all 2N models separately can easily over-fit to the data. Weight-sharing among the models
can prevent over-fitting. In that case, expert networks have a distributed representation. Therefore,
the standout network, very much like a gating network, also produces a distributed representation
to stochastically choose which expert to turn on for a given input. This means, in this distributed
representation, 2N models are chosen by N binary numbers.
The standout network partitions the input space into different regions that are suitable for each
experts. We can visualize the effect of the standout network by showing the units that output high
standout probability for one class but, not others. The standout network learns that some hidden units
are very important for one class and will tend to keep those. These hidden units are then more likely to
be dropped when the input comes from a different class, see figure (3.1).
3.7 Learning
For a specific configuration m of the mask variables, let L(m,w) denote the likelihood of a training set
or a minibatch, where w is the set of neural network parameters. It may include a prior as well. The
dependence of L on the input and output have been suppressed for notational simplicity. Given the
current dropout parameters, π, the standout network acts like a binary belief network that generates a
distribution over the mask variables for the training set or minibatch, denoted P (m|π,w). Again, we have
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 11
0 50 100 150 200 250
0
50
100
150
200
250
Standout Network weights
0 50 100 150 200 250
0
50
100
150
200
250
Neural Network weights
Figure 3.2: First layer standout network filters and neural network filters learnt from MNIST data usingour method.
suppressed the dependence on the input to the neural network. As described above, this distribution
should not be viewed as the distribution over hidden variables in a latent variable model, but as an
approximation to a Bayesian posterior distribution over model architectures.
The goal is to adjust π and w to make P (m|π,w) close to the true posterior over architectures as
given by L(m,w), while also adjusting L(m,w) so as maximize the data likelihood w.r.t. w. Since
both the approximate posterior P (m|π,w) and the likelihood L(m,w) depend on the neural network
parameters, we use a crude approximation that we found works well in practice. If the approximate
posterior were as close as possible to the true posterior, then the derivative of the free energy F (P,L)
w.r.t P would be zero and we can ignore terms of the form ∂P/∂w. So, we adjust the neural network
parameters using the approximate derivative,
−∑m
P (m|π,w) ∂
∂wlogL(m,w), (3.9)
which can be computed by sampling from P (m|π,w). For a given setting of the neural network pa-
rameters, the standout network can in principal be adjusted to be closer to the Bayesian posterior by
following the derivative of the free energy F (P,L) w.r.t. π. This is difficult in practice, so we use an
approximation where we assume the approximate posterior is correct and sample a configuration of m
from it. Then, for each hidden unit, we consider mj = 0 and mj = 1 and determine the partial contri-
bution to the free energy. The standout network parameters are adjusted for that hidden unit so as to
decrease the partial contribution to the free energy. Namely, the standout network updates are obtained
by sampling the mask variables using the current standout network, performing forward propagation in
the neural network, and computing the data likelihood. The mask variables are sequentially perturbed
by combining the standout network probability for the mask variable with the data likelihood under the
neural network, using a partial forward propagation. The resulting mask variables are used as complete
data for updating the standout network.
The above learning technique is approximate, but works well in practice and achieves models that
outperform standard dropout and other feature learning techniques, as described below.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 12
Algorithm 1: Standout learning algorithm: alg1 and alg2
1 Notation: H(·) is Heaviside step function ;
Input: w, π, α, β
2 alg1: initialize w, π randomly; alg2: initialize w randomly, set π = w;
3 while not stopping criteria do
4 for hidden unit j = 1, 2, ... do
5 P (mj = 1|{ai : i < j}) = f(α∑
i:i<j πj,iai + β);
6 mj ∼ P (mj = 1|{ai : i < j});7 aj = mjg
(∑i:i<j wj,iai
);
8 end
9 Update neural network parameter w using ∂∂w logL(m,w);
/* alg1 */
10 for hidden unit j = 1, 2, ... do
11 tj = H(L(m,w|mj = 1)− L(m,w|mj = 0)
)12 end
13 Update standout network π using target t ;
/* alg2 */
14 Update standout network π using π ← w ;
15 end
3.8 Exploratory experiments
Here, we study different aspects of our method using MNIST digits (see below for more details). We
trained a shallow one hidden layer auto-encoder on MNIST using the approximate learning algorithm.
We can visualize the effect of the standout network by showing the units that output low dropout
probability for one class but not others. The standout network learns that some hidden units are
important for one class and tends to keep those.
We trained a shallow one hidden layer auto-encoder on MNIST using the approximate learning
algorithm. The resulting first layer filters of both the standout network and auto-encoder are shown in
figure (3.2). We noticed that the weights in both networks are very similar. It is also very expensive
to perturb hidden units one by one when training the standout network. It motivates us to share the
parameter W and π and train them together using only the M step. To account for different scales and
shifts, we set π = αw + β, where α and β are learnt. This technique works very well in practice and
yields similar results to having two separate set of parameters. Namely, for unsupervised learning on
MNIST, we obtain 153 error for shared parameters and 158 error for unshared parameters.
3.9 Choice of standout function and hyper-parameters α and β
Concretely, we found empirically that the standout network parameters trained in this way are quite
similar (although not identical) to the neural network parameters, up to an affine transformation. This
motivated our second algorithm alg2 in psuedocode(1), where the neural network parameters are trained
as described in learning section 3.7, but the standout parameters are set to an affine transformation of
the neural network parameters with hyper-parameters alpha and beta. These hyper-parameters are
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 13
determined as explained below. We found that this technique works very well in practice, for the
MNIST and NORB datasets (see below). For example, for unsupervised learning on MNIST using the
architecture described below, we obtained 153 errors for tied parameters and 158 errors for separately
learnt parameters. This tied parameter learning algorithm is used for the experiments in the rest of
the paper. In the above description of our method, we mentioned two hyper-parameters that need to
be considered: the scale parameter α and the bias parameter β. Here we explore the choice of these
parameters, by presenting some experimental results obtained by training a dropout model as described
below using MNIST handwritten digit images.
α controls the sensitivity of the standout function to the weighted sum of inputs that is used to
determine the hidden activity. In particular, α scales the weighted sum of the activities from the layer
before. In contrast, the bias β shifts the standout probability to be high or low, and ultimately controls
the sparsity of the hidden unit activities. A model with a more negative β will have most of its hidden
activities concentrated near zero.
Figure (3.3a) illustrates how choices of α and β change the dependence of the standout probability
on the input. It shows a histogram of hidden unit activities after training networks with different α’s
and β’s on MNIST images.
We also considered different forms of the standout function besides the logistic function, as shown in
figure 3.3(b). The effect of different functional forms can be observed in the histogram of the activities
after training on the MNIST images. The logistic standout function creates a sparse distribution of
activation values, whereas the functions such as 1− 4(1− σ(·))σ(·) produce a multi-modal distribution
over the activation values shown in 3.3(c).
3.10 Analysis of a two-layer network
Consider a network with a single hidden layer with linear outputs. Let yi be the ith output value in the
output vector. The squared error cost function is
J =1
2
∑i
(yi − yi)2 (3.10)
Let g(·) be the hidden variable activation function and f(·) be the standout function. Let hj =
g(∑
i uijxi) be the hidden activation and yi =∑
j wijhjMj be the output, where the u’s and the
w’s are the weights for the hidden variables and the outputs, and Mj are the binary random variables
from equation (3.2). The expected loss can be written as:
E[J ] =1
2
D∑i=1
(∑j
w2ijh
2jVar[Mj ]
+ (∑j
wijhjE[Mj ]− yi)2
), (3.11)
where the statistics of M are computed under P (M |W ). Note that Mi and Mj are conditional indepen-
dent given the input x, so that Cov(Mi,Mj |x) = 0, for i = j. We define pj = p(Mj = 1|x) = E[Mj ] =
s(∑
i uijxi).
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 14
We take the gradient of the expected loss with respect to decoder weights, wij :
∂E[J ]
∂wij= wijh
2jVar[Mj ] + hjpj
(∑k
wikhkpk − xi
)(3.12)
We also derive the gradient for encoder weights, uij .
∂E[J ]
∂uij=∑d
w2djhjVar[Mj ]g
′(hj)xi
+∑d
(∑k
wdkhkpk − xd
)wdjpjg
′(hj)xi
+∑d
w2djh
2j
[s′(pj)− 2pjs
′(pj)]xi
+∑d
(∑k
wdkhkpk − xd
)wdjhjs
′(hj)xi (3.13)
To compare, we write down the expected gradient used in the standout algorithm.
E
[∂J
∂wij
]= wijh
2jVar[Mj ] + hjpj
(∑k
wikhkpk − xi
)(3.14)
E
[∂J
∂uij
]=∑d
w2djhjVar[Mj ]g
′(hj)xi
+∑d
(∑k
wdkhkpk − xd
)wdjpjg
′(hj)xi (3.15)
Comparing equations (3.11) and (3.14), we found that the gradient of the expected loss is the same
as the expected gradient for decoder weights. Whereas for encoder weights, the expected gradient of
equation (3.14) is only the first two terms in the gradient of the expected loss from equation (3.13). We
observed that the standout algorithm computes the correct gradient for decoder weights according to
the expected loss. However, the algorithm does not properly account for the encoder weights as we can
see the difference between equation (3.13) and (3.14). In other words, the gradient updates from the
algorithm, described in the beginning of this section, does not take into account of the partial derivatives
of the parameters with respect to the standout function. Nevertheless, we ignored the extra terms in
the gradient of expected loss for computational simplicity reasons.
The key observation here is that the stochastic noise, we added to the neural network through the
standout network, acts as an adaptive regularization during training. We consider the first term in both
expected gradient update equations (3.14), (3.15). They take on the form of our model parameter W
multiplying a coefficient that is a function of the variance and hidden activation. It is equivalent to
having an quadratic penalty term in the actual cost function. Intuitively, the parameters of the model
will be penalized more if a hidden unit has more variability across the dataset. On the other hand, if
the hidden unit is confident about its prediction, it will be penalized less. This explicit terms in the
gradients corresponding to out intuition about the standout network from the previous section.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 15
3.11 Optimization and momentum
Algorithm 2: Stochastic gradient descent with Neterov momentum
Input: θ0, v0 = 0, η0, γ0
1 initialization;
2 for k = 1, 2, ... do
3 adjust γk and ηk following some scheme;
4 δk = ∂J(θk−1+vk−1)∂θ ;
5 vk = γkvk−1 + ηkδk;
6 θk = θk−1 + vk;
7 end
We optimize the model parameters using the stochastic gradient descent with the Nesterov momen-
tum technique [19]. The Nesterov momentum method can effectively speed up learning when applied
to large models in comparison to to standard momentum. When using the Nesterov momentum, the
cost function J and derivatives ∂J∂θ are evaluated at θ + vk [24], where vk = γvk−1 + η ∂J
∂θ is the velocity
and θ is the model parameter. γ < 1 is the momentum coefficient and η is the learning rate. Nesterov
momentum takes into account the velocity in parameter space when computing updates. Therefore, it
further reduces oscillations compared to standard momentum.
We schedule the momentum coefficient γ to further speed up the learning process. γ starts at 0.5 in
the first epoch and linearly increases to 0.99. The momentum stays at 0.99 during the majority portion
of the learning and is then linearly ramped down to 0.5 during the end of learning. ***insert scheduling
plots here***
It is important for the momentum coefficient γ to be bounded. It can be shown that when γk is
bounded between 0 and 1, vk is bounded to be a finite value. Furthermore, in order for stochastic opti-
mization to converge to some local optima, the learning rate η has to satisfy the following requirements
[26]: the infinite sum of ηk summing over k shall be unbounded; the infinite sum of (ηk)2 needs be
bounded to be some finite value. These three requirements can be summarized as:
0 < γk < 1, ∀k, (3.16)
∞∑k=0
ηk = ∞, (3.17)
∞∑k=0
(ηk)2 < ∞, (3.18)
Intuitively, the equation (3.17) ensures that the optimization is able to move far away from the
initialization and is likely to fall into a basin. The equation (3.18) ensures that the parameters will
converge to the local optima instead of bouncing around the mode. It is worth mentioning that the
exponential decay scheduling used in the experiments does not satisfy the equation 3.17. Nevertheless,
the experimental results shows that such schedule perform well in practice.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 3. Learning with Adaptive Dropout Network 16
0.0 0.2 0.4 0.6 0.8 1.00
2
4
6
8
10
12
14
16log histogram of hidden activities
AE α=0,β= +∞
dropout α=0,β=0
standout α=1,β=0
standout α=1,β=−1
standout α=1,β= +1
standout α=2,β= +2
standout α=0.5,β=1
(a)
0.0 0.2 0.4 0.6 0.8 1.00
2
4
6
8
10
12
14
16log histogram of hidden activities
standout σ(z)
standout −σ(z)
standout 1−4σ(z)σ(−z)
(b)
−6 −4 −2 0 2 4 60.0
0.2
0.4
0.6
0.8
1.0
various standout functions
σ(z)
1−4σ(z)σ(−z)
1−σ(z)
(c)
Figure 3.3: (a) Histogram of hidden unit activities for various choices of hyper-parameters using thelogistic standout function, including those configurations that are equivalent to dropout and no dropout-based regularization (AE). (b) Histograms of hidden unit activities for various standout functions. (c)Various standout function f(·) plots
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4
Experimental Results
We consider both unsupervised learning and discriminative learning tasks, and compare results ob-
tained using standout to those obtained using restricted Boltzmann machines (RBMs) and auto-encoders
trained using dropout, for unsupervised feature learning tasks. We also investigate classification perfor-
mance by applying standout during discriminative training using the MNIST and NORB [17] datasets.
4.1 Nonlinearity for feedforward network
We used the ReLU [23] activation function for all the of results reported here, both on unsupervised
and discriminative tasks. The ReLU function can be written as g(x) = max(0, x). We found that its
use significantly speeds up training by up to 10-fold, compared to the commonly used logistic activation
function. The speed-up we observed can be explained in two ways: first, computations are saved from
the simple max operation to compute ReLU compared to the exponential function that needs to be
applied to compute the logistic function. Also, ReLUs do not suffer from the vanishing gradient problem
that logistic functions have for very large inputs.
4.2 Computation time
We used the publicly available gnumpy library [32] to implement our models. The models mentioned
in this work are trained on a single Nvidia GTX 580 GPU. As in psuedocode(1), the first algorithm is
relatively slow, since the number of computations is O(n2) where n is the number of hidden units. The
second algorithm is much faster and takes O(kn) time, where k is the number of configurations of the
hyper-parameters alpha and beta that are searched over. In particular, for a 784-1000-784 auto-encoder
model with mini-batches of size 100 and 50,000 training cases on a GTX 580 GPU, learning takes 1.66
seconds per epoch for standard dropout and 1.73 seconds for our second algorithm.
Model Regularization Seconds per epoachAE 784-1000-784 dropout 1.66AE 784-1000-784 standout 1.73
Figure 4.1: Computation time for mini-batches of 100 with 50,000 MNIST training cases
The computational cost of the improved representations produced by our algorithm is that a hyper
17
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 18
0 50 100 150 200 250
0
50
100
150
200
250
MNIST dataset
Figure 4.2: Random images from MNIST dataset
parameter search is needed. We note that some other recently developed dropout-related methods, such
as maxout, also involve an additional computational factor.
4.3 Handwritten Digits Recognition: MNIST
The MNIST handwritten digit dataset is generally considered as a well-studied problem, which offers
the ability to ensure that new algorithms produce sensible results when compared to the many other
techniques that have been benchmarked. It consists of ten classes of handwritten digits, ranging from 0
to 9. There are, in total, 60,000 training images and 10,000 test images. Each image is 28 × 28 pixels
in size. Following the common convention, we randomly separate the original training set into 50,000
training cases and 10,000 cases used to validate the choice of hyper-parameters. We concatenate all the
pixels in an image in a raster scan fashion to create a 784-dimensional vector. The task is to predict the
10 class labels from the 784-dimensional input vector.
We show that standout networks can extract meaningful features in a self-taught fashion. The
features extracted using the standout network not only outperformed other off-the-shelf feature learning
methods but also it is very computational efficient comparing to more complicated methods like sparse
coding.
We use the following procedures for feature learning. We first extract the features using one of the
unsupervised learning algorithms in figure (4.4). The usefulness of the extracted features were then
evaluated by training a linear classifier to predict the object class from the extracted features. This
process is similar to that employed in other feature learning research [25].
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 19
arch. act. func. err.raw pixel 784 7.2%
RBM784-1000
σ(·) 1.81%weight decay
DAE 784-1000-784 ReLU(·) 1.95%dropout 784-1000-784
ReLU(·) 1.70%AE 50% hidden dropoutstandout 784-1000-784
ReLU(·) 1.53%AE standout
error rateRBM + FT 1.24%DAE + FT 1.3%shallow dropout AE + FT 1.10%deep dropout AE + FT 0.89%standout shallow AE + FT 1.06%standout deep AE + FT 0.80%
We trained a number of architectures on MNIST, including standard auto-encoders, dropout auto-
encoders and standout auto-encoders. As described previously, we compute the expected value of each
hidden activity and use that as the feature when training a classifier. We also examined RBM’s, where
we the soft probability for each hidden unit was used as a feature. Different classifiers can be used to
give similar performance; we used a linear SVM because it is fast and straightforward to apply. However,
on subset of problems we tried logistic classifiers and they achieved indistinguishable classification rates.
Results for the different architectures and learning methods are compared in table 4.3. The auto-
encoder trained using our proposed standout technique with α = 1 and β = 0 performed the best on
MNIST.
In deep learning, a common practice is to use the encoder weights learnt by an unsupervised learning
method to initialize the early layers of a multilayer discriminative model. The backpropagation algorithm
is then used to learn the weights for the last hidden layer and fine tune the weights in the layers before.
This procedure is often referred to as discriminative fine-tuning. We initialized neural networks using
the models described above. The regularization method that we used for unsupervised learning (RBM,
dropout, standout) is also used for corresponding discriminative fine-tuning. For example, if a neural
network is initialized using an auto-encoder trained with standout, the neural network will also be fine
tuned using standout for all its hidden units, with the same standout function and hyper-parameters as
the auto-encoder.
During discriminative fine-tuning, we fixed the weights for all layers except the last one for the first 10
epoch, from then onwards, the weights are jointly updated updated. Congruent with previous authors,
we find that classification performance is usually improved by the use of discriminative fine tuning.
Impressively, we also found that a two-hidden-layer neural network with 1000 ReLU units in its first,
and its second hidden layers trained with standout, is able to achieve 80 errors on MNIST data after
fine tuning (error rate of 0.82%). This performance is better than the current best non-convolutional
result [11] and the training procedure is simpler.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 20
0 100 200 300 400 500 600
0
100
200
300
400
500
600
NORB dataset
Figure 4.3: Random images from NORB dataset
4.4 3D object Recognition: NORB
The small NORB normalized-uniform dataset contains 24,300 training examples and 24,300 test exam-
ples. It consists of 50 different objects from five different classes: cars, trucks, planes, animals, and
humans. Each data point is represented by a stereo image pair of size 96 × 96 pixels. The training
and test set used different object instances, and images are created under different lighting conditions,
elevations and azimuths. In order to perform well in NORB, the recognition task demands learning
algorithms to learn features that can can be generalized to test set and be able to handle large input
dimension. This makes NORB significantly more challenging than the MNIST dataset. The objects
in the NORB dataset are 3D, under difference out-of-plane rotation, and so on. Therefore, the models
trained on NORB have to learn and store implicit representations of 3D structure, lighting and so on.
We formulate the data vector following the same “foveal” vision preprocessing technique used in [22], so
that the final training data vector has 8976 dimensions. Data points are subtracted by the mean and
divided by the standard deviation along each input dimension across the whole training set to normalize
the contrast. The goal is to predict the five class labels for the previously unseen 24,300 test examples.
The training set is separated into 20,000 for training and 4,300 for validation.
For this dataset, we initially tried to train a fully connected discriminative neural network from
scratch, but it performed poorly, even with the help of dropout. (We trained a two hidden layer neural
network on NORB to obtain 13% error rate and saw no improvement by using dropout). Such disap-
pointing performance motives us to investigate unsupervised feature learning and pre-training strategies
with standout method, using the same precedure described in the MNIST section.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 21
arch. act. func. err.raw pixel 8976 23.6%
RBM8976-4000
σ(·) 10.6%weight decay
DAE 8976-4000-8976 ReLU(·) 9.5%dropout 8976-4000-8976
ReLU(·) 10.1%AE 50% hidden dropoutdropout 8976-4000-8976
ReLU(·) 8.9%AE * 22% hidden dropoutstandout 8976-4000-8976
ReLU(·) 7.3%AE standout
Figure 4.4: Performance of unsupervised feature learning, dropout AE * is optimized over its dropoutprobability using [31]
500 1000 1500 2000 2500 3000 3500 4000 4500
numer of hidden units
6
8
10
12
14
16
18
20
test
err
or
rate
(%
)
Classification error rate as a function of number of hidden units
DAE
Dropout AE
Deterministic standout AE
Standout AE
Figure 4.5: Classification error rate as a function of number of hidden units on NORB.
We performed extensive experiments on the NORB dataset with larger models. The hyperparameter
used for the best result was obtained by setting α = 1 and β = 1. Overall, we observed similar trends to
the ones we observed for MNIST. The standout method consistently performs better than other methods,
as shown in table (4.4).
On NORB dataset, we similarly performed discrimitive fine-tuning and observed performance gain as
on the MNIST dataset. We achieved 6.2% error rate by fine tuning the simple shallow auto-encoder from
table (4.4). Furthermore, a two-hidden-layer neural network with 4000 ReLU units in both hidden layers
that is pre-trained using standout, achieved 5.8% error rate after fine tuning. It is worth mentioning
that a small weight decay of 0.0005 is applied to this network during fine-tuning to further prevent
error rateDBN [29] 8.3%DBM [29] 7.2%third order RBM [22] 6.5%dropout shallow AE + FT 7.5%dropout deep AE + FT 7.0%standout shallow AE + FT 6.2%standout deep AE + FT 5.8%
Figure 4.6: Performance of fine-tuned classifiers, where FT is fine-tuning
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 22
0 50 100 150 200 250 300
0
50
100
150
200
250
300
CIFAR-10 dataset
Figure 4.7: Random images from CIFAR-10 dataset
overfitting. It outperforms other models that do not exploit spatial structure. As far as we know, this
result is better than any previously published results without distortion or jitter. It even outperforms
carefully designed convolutional neural networks found in [12].
Figure 4.6 reports the classification accuracy obtained by different models, including state-of-the-art
deep networks.
4.5 Tiny Images Object Recogntion: CIFAR-10
CIFAR-10 dataset [14] is a labeled subset of data collected from 80 million tiny images dataset [33]. The
dataset contains 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship,
truck. The dataset is divided into 50,000 training images and 10,000 test images. Each image is the
size of 32x32 pixels in 3 color channels. There are, in total 3072 dimension inputs associated with each
image. We did a very simple pre-processing to prepare the CIFAR-10 dataset. We subtracted the mean
and divided the standard deviation from each pixel across the whole training dataset. The test set is
pre-processed using the same mean and standard deviation computed from the training set.
Though some object classes in this dataset have a lot of variability in pose and viewing angle,
there are only 50,000 training data points. It is common to provide domain prior knowledge into
the topology of neural network models to the help object recognition task. We use a feed-forward
deep convolutional architecture for the CIFAR-10 dataset. Convolutional neural network is a special
class of neural networks that is designed to have translational invariance property. There are two
major components in a convolutional neural network: the convolution layer and the pooling layer. The
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 23
error rate2 layer NN, softmax 44%3 layer conv. net, avg. pooling, 64-64-64-10 20.3%3 layer conv. net, max pooling, 64-64-64-10 19.2%3 layer conv. net, avg. pooling, 64-64-64-10, standout 21.0%3 layer conv. net, max/avg. pooling, 64 filters + locally connected layer [11] 16.6%3 layer conv. net, max/avg. pooling, 64 filters + locally connected layer, dropout [11] 15.6%3 layer conv. net, stochastic pooling, 64 filters + fully connected layer, dropout [35] 15.16%3 layer conv. net, avg. pooling, 128-128-128-10, standout 18.4%3 layer conv. net, avg. pooling, 128-128-128-1000-10 24.5%3 layer conv. net, avg. pooling, 128-128-128-1000-10, standout 17.0%3 layer conv. net, avg. pooling, 256-256-256-1000-10, standout 15.2%
Figure 4.8: Performance of CIFAR-10 classifiers
neurons in the convolution layer are connected to a small receptive field in the layer below. At each
spatial location, there is a bank of neurons looking over the same receptive field with different weights.
These groups are repeated across all the of neurons at different location with shared weights. The
pooling layers, which are stacked on top of the convolution layer, contain neurons reporting the average
or maximum activity of the local pool of convolutional neurons. Usually in a deep convolutional neural
network, convolution and pooling layers are alternatively stacked for three stages with a total of six
layers. A fully connected hidden layer followed by a softmax layer on the last pooling layer is used to
make model prediction.
We applied standout to all the neurons in the convolution layers and the fully connected layers.
The experiment results are shown in table (4.5). The models are trained discriminatively from scratch.
Unlike the models trained in [11], we do not have a local contrast normalization after each pooling layer.
Also, all the pooling layers are the same type. We notice that standout can help reduce overfitting when
training the large models. Empirically, the classification performance increases with the number of filters
in the filter banks. We expect that the performance can be further improved through using a larger
number of filters. Additionally, contrast normalization layer such as the one used in [11] can also improve
the performance. It is worth mentioning that, in all the experiments, we found that aggressive stochastic
noise, such as a 50% dropout and standout with large positive β, works well with average pooling but not
with max pooling. The large stochastic noise makes the optimization diverge for convolutional neural
net with max pooling scheme.
4.6 Discussion
The standout method is able to outperform other feature learning methods in both datasets by a no-
ticeable margin. The stochasticity introduced by standout removes hidden units that have little effects
with high probability during the training phase. The learning process can focus on refining features and
pattern that are important to represent the data points. By inspecting the weights from auto-encoders
regularized by dropout and standout, the weights from standout auto-encoder are more refined and
shaper than those learnt by dropout. It is commonly hypothesized that sharper filters or dictionary
elements generate more meaningful features.
The effect of the number of hidden units is studied using network size {500, 1000, 1500, ..., 4500}.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 4. Experimental Results 24
Figure(4.5) shows that all algorithms can generally perform better by increasing the number of hid-
den units. One notable trend for dropout regularization is that dropout achieves significantly better
performance with a large number of hidden units since all units have an equal chance to be omitted.
Comparatively, standout can achieve similar performance with only half number of hidden units as in
dropout. This is because useful hidden units will be kept around more often, while less effective units
are dropped out.
We also believe that randomness in training is crucial in order to regularize the network. Such
regularization would not be attainable just by using a different activation function. In order to verify
this hypothesis, we simply trained a deterministic auto-encoder with hidden activation function as,
equation(3.7), using standard backpropagation algorithm. The result is shown in figure(4.5) as the
deterministic standout curve.
It is believed that sparse features can help improve the performance of linear classifiers. An auto-
encoder trained using the combination of ReLU and standout naturally generate a set of sparse features.
We indeed observed the competitive performance in figure(4.4). Here, we are interested in the whether
training a sparse auto-encoder will yield similar performance results as standout method. We applied
soft L1 penalty on the hidden units and trained an auto-encoder to match the implicit sparsity of the
standout model. The final features extracted using the sparse auto-encoder achieved 10.2% error on
NORB. We expect to improve the performance of the sparse auto-encoder by further tuning its hyper
parameters. On the other hand, the hyper parameters for standout models are easier to tune and has
little effect on the final performance. Moreover, the sparse features learnt using standout method are
also computationally efficient compared to more sophisticated encoding algorithms like the sparse coding
used in [8]. In order to find the code for data points with more than 4000 dimensions and 4000 dictionary
elements, the sparse coding algorithm quickly become impractical.
Surprisingly, shallow network with standout regularization in table(4.4) outperforms some of the
much larger and deeper networks shown in table(4.4). Some of those deeper models have three or four
times parameters than the shallow network we trained here. This particular result show that a simple
model with proper regularization can still achieve high performance compared to the more complicated
counterparts.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Chapter 5
Conclusion
In this work, we introduced an adaptive dropout algorithm, called standout, which allows different hid-
den variables to have different dropout probabilities, by overlaying the original deep network with a
probabilistic binary belief network that shares parameters. We showed that standout is useful for both
unsupervised feature learning and supervised classification tasks. Models trained using standout con-
sistently achieve better generalization performance on classification tasks, compared to existing feature
learning methods on three sets of experiments based on MNIST and NORB datasets. The results on
standout reinforce the notion that stochasticity during learning process is crucial for better generaliza-
tion and learning more useful features. These results provide additional support for the ‘regularization
by noise’ hypothesis that has been used to regularize other deep architectures, including RBMs, denois-
ing auto-encoders, and dropout. Based on our experiments on a deep auto-encoder, we find that with
proper regularization, it is possible to pre-train a stacked auto-encoder jointly from scratch, without
greedy layer-by-layer training, and obtain good results. Thus, this work points to new ways of training
deep architectures.
An obvious missing piece in this research is a good theoretical understanding of why the standout
function provides better regularization compared to the fixed dropout probability of 0.5. While we
have motivated our approach intuitively, we have not yet provided a theoretical justification for our
approach; currently, its value is based on its experimental performance. Generally, ‘regularization by
noise’ methods may be viewed as certain kinds of approximations to Bayesian learning. In this context,
dropout assumes that different hidden units have the same uncertainty a posteriori, whereas standout
allows different hidden units to have different amounts of uncertainty. In this view, the overlayed binary
belief network can be thought of as a probabilistic approximation to a regularizing posterior distribution
over architectures. We are currently exploring this interpretation.
25
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
Bibliography
[1] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from
examples without local minima. Neural networks, 2(1):53–58, 1989.
[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.
Advances in neural information processing systems, 19:153, 2007.
[3] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends R© in Machine Learning,
2(1):1–127, 2009.
[4] C.M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation,
7(1):108–116, 1995.
[5] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector
quantization. In International Conference on Machine Learning, volume 8, page 10, 2011.
[6] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised
feature learning. In International Conference on Artificial Intelligence and Statistics, pages 215–223,
2011.
[7] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
[8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In International
Conference on Machine Learning, 2010.
[9] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.
[10] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507, 2006.
[11] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[12] Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best
multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International
Conference on, pages 2146–2153. IEEE, 2009.
[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural
networks. Advances in Neural Information Processing Systems, 25, 2012.
26
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
BIBLIOGRAPHY 27
[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Mas-
ter’s thesis, Department of Computer Science, University of Toronto, 2009.
[15] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff
Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised learning. arXiv
preprint arXiv:1112.6209, 2011.
[16] Yann Le Cun. Learning process in an asymmetric threshold network. In Disordered systems and
biological organization, pages 233–240. Springer, 1986.
[17] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition
with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR
2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE,
2004.
[18] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university
press, 2003.
[19] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In
Proc. 28th Int. Conf. on Machine Learning, 2011.
[20] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.
[21] Marvin Minsky and Seymour Papert. Perceptron: an introduction to computational geometry. The
MIT Press, Cambridge, expanded edition, 19:88, 1969.
[22] V. Nair and G. Hinton. 3d object recognition with deep belief nets. Advances in Neural Information
Processing Systems, 22:1339–1347, 2009.
[23] V. Nair and G.E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc.
27th International Conference on Machine Learning, pages 807–814. Omnipress Madison, WI, 2010.
[24] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).
In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
[25] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit
invariance during feature extraction. In Proceedings of the Twenty-eight International Conference
on Machine Learning (ICML11), 2011.
[26] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathe-
matical Statistics, pages 400–407, 1951.
[27] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural compu-
tation, 11(2):305–345, 1999.
[28] David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by
back-propagating errors. Nature, 323(6088):533–536, 1986.
[29] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In
International Conference on Artificial Intelligence and Statistics. Citeseer, 2010.
متلب سایت
MatlabSite.com
MatlabSite.com متلب سایت
BIBLIOGRAPHY 28
[30] J. Sietsma and R.J.F. Dow. Creating artificial neural networks that generalize. Neural Networks,
4(1):67–79, 1991.
[31] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine
learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968,
2012.
[32] Tijmen Tieleman. Gnumpy: an easy way to use gpu boards in python. Department of Computer
Science, University of Toronto, 2010.
[33] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data set
for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 30(11):1958–1970, 2008.
[34] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders:
Learning useful representations in a deep network with a local denoising criterion. The Journal of
Machine Learning Research, 11:3371–3408, 2010.
[35] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural
networks. arXiv preprint arXiv:1301.3557, متلب سایت.2013
MatlabSite.com
MatlabSite.com متلب سایت