brain-inspired methods for training deep neural networks
TRANSCRIPT
BAR-ILAN UNIVERSITY
Brain-inspired methods for training
deep neural networks
Alexander Kalmanovich
Submitted in partial fulfillment of the requirements for the Master's
Degree in the Gonda Multidisciplinary Center for Brain Research,
Bar-Ilan University
Ramat Gan, Israel 2015
2
This work was carried out under the supervision of
Prof. Gal Chechik
From the Gonda Multidisciplinary Brain Research Center,
Bar-Ilan University
3
Contents
1. Abstract ......................................................................................................................................... 4
2. Introduction ................................................................................................................................... 4
3. Methods......................................................................................................................................... 5
3.1 Feed forward neural network .................................................................................................... 5
3.2 Multi class classification ........................................................................................................... 5
3.3 Hyper parameter selection with cross validation ...................................................................... 5
3.4 The cross entropy loss function ................................................................................................ 6
3.5 Datasets ..................................................................................................................................... 6
3.6 Implementation ......................................................................................................................... 6
4. DNNs with non-recurrent lateral connections ............................................................................... 6
4.1 An update rule with lateral connections .................................................................................... 7
4.2 Back propagation for a network with lateral connections ......................................................... 7
4.3 Tuning hyper parameters for networks with lateral-connections .............................................. 8
4.4 Experiments on data with non-uniform class distribution ........................................................ 9
4.5 Results for experiments with lateral connections in DNNs ...................................................... 9
4.6 Conclusions ............................................................................................................................. 11
5. Gradual training of denoising auto encoders .............................................................................. 12
5.1 Training denoising auto encoders ........................................................................................... 12
5.2 Gradual training of deep DAEs ............................................................................................... 14
5.3 Training procedure .................................................................................................................. 15
5.4 Results for experiments with gradually trained DAEs ............................................................ 15
5.4.1 Unsupervised learning for denoising .............................................................................. 15
5.4.2 Gradual-training DAE for initializing a network in a supervised task ............................ 18
6. Discussion ................................................................................................................................... 20
7. References ................................................................................................................................... 21
8. Appendix ..................................................................................................................................... 23
8.1 Derivation of loss functions .................................................................................................... 23
8.2 Implementing DNN with lateral connections using standard DNN formulation .................... 23
4
1. Abstract
Following recent advances in machine learning based on principles of learning and information processing
in the human brain, we investigate two brain-inspired approaches for training neural networks: (1) lateral
connections in deep neural networks (DNN) and (2) gradual training of deep denoising auto encoders
(DAE).
First we used DNNs with non-recurrent lateral connections in hidden layers. However, we found no
significant benefit in using lateral connections over a network without lateral connections.
We then investigated a training scheme of a deep DAE, where DAE layers are gradually added and
keep adapting as additional layers are added. We show that in the regime of mid-sized datasets, this gradual
training provides a small but consistent improvement over stacked training in both reconstruction quality
and classification error over stacked training on MNIST and CIFAR datasets.
2. Introduction
Deep neural networks (DNN) are powerful learning models that have recently achieved excellent
performance on visual and speech recognition problems (Hinton et al. 2012; Krizhevsky, Sutskever, and
Hinton 2012; Razavian et al. 2014). Many advances in DNNs are based on principles of learning and
information processing inspired by the human brain. Leading architectures of neural networks (Krizhevsky,
Sutskever, and Hinton 2012; Zeiler and Fergus 2013; Szegedy et al. 2014) have local receptive fields and
hierarchical representation from local to global features, and share some properties with sensory pathways
in biological neural systems. For instance, local receptive fields are a hallmark of the V1 area in the visual
cortex (Hubel and Wiesel 1959), and transition from local to global features is a hallmark of the dorsal and
ventral visual pathways (Felleman and Van Essen 1991). We conjecture that building an artificial system
for accurate image recognition – a task that humans perform much better than artificial systems, could
continue to benefit from drawing inspiration from principles of learning and development in neural systems
(Ranzato 2014).
In this work, we investigated two brain-inspired approaches for learning with DNNs: (1) non-recurrent
lateral connections in DNNs and (2) gradual training of deep denoising auto encoders.
Lateral connections between neurons in the primary visual cortex play an important role in visual
information processing (Stettler et al. 2002). In comparison, DNNs typically have either a feed forward
architecture or a recurrent architecture. Feed forward DNNs have no lateral connections and recurrent
architectures typically have strong recurrencies making them harder to train.
We test here DNNs with non-recurrent lateral connections. We hypothesized that lateral inhibition
could be advantageous for a unit to propagate its signal to higher layers, while silencing other “competitor”
units in its layer. However, we found no significant benefit in using the type of lateral connections that we
tested.
In the second part of this work we investigated a training scheme of a denoising auto encoder (DAE),
where DAE layers are gradually added and keep adapting as additional layers are added. In the regime of
small datasets, DNNs are difficult to train since the amount of information provided by labels may be small
compared to the number of free paramteres. This can be addressed by initializing networks through an
unsupervised phase. This approach was introduced using a greedy layer-wise unsupervised learning
algorithm for Deep Belief Networks (Hinton and Salakhutdinov 2006; Bengio et al. 2007). A similar
5
approach for training multi-layer (deep) DAE has been introduced by Vincent et al. (2010), where a deep
DAE is built by training a single hidden layer at each step, while freezing weights of lower layers.
Areas in the primate visual cortex mature in a gradual order (Guillery 2005; Bourne and Rosa 2006)
and early layers in mammalian visual system keep adapting for prolonged periods, and their synapses
remain plastic long after representations have been formed in high brain areas (Liu, Murray, and Jones
2004). We therefore turned to explore alternative training schedules for deep DAEs, which avoid freezing
early weights.
We test here ‘gradual training’, where training occurs layer-by-layer, but lower layers keep adapting
throughout training. We compare gradual training to stacked training and to a hybrid approach, all under a
fixed budget of training update steps. We then test gradual training as an initialization for supervised
learning, and quantify its performance as a function of dataset size. Gradual training provides a small but
consistent improvement in reconstruction error and classification error in the regime of mid-sized datasets.
3. Methods
3.1 Feed forward neural network
The machine learning model we use to solve a classification problem is a feed forward neural network
(Minsky and Papert 1969; Rumelhart, Hinton, and Williams 1988). In a standard feed forward DNN, the
units are layered, such that each unit is fully connected to all units in the previous layer but not to other
units in the same layer. The activation of unit � at layer � is ��� = ∑ ������� where ��� denotes
weight of the connection from unit at layer � to unit � at layer � + 1. The output of unit � at layer � is
��� = ������ where g is the activation function. All experiments described below use units with sigmoid
activation function: ���� = ������. We denote the input layer as 1, so the forward pass is from layer � is
to layer � + 1.
3.2 Multi class classification
In the multi-class classification problem we aim to solve, a training set consists of � images ��, … , �� with
corresponding labels ��, … , ��. Each label belongs to one of � possible classes: �� ∈ { �, . . . , "}. A
classifier DNN has $ units in the input layer and � units in the output layer. For example when the input is
a raw image, $ is the number of pixels in the image (all images are of the same size in our case). During
training, when considering a sample �� that belongs to class %, we define target output vector &� such that
& is zero except for &% = 1. The loss is then computed over the output of the network '� and &�. A DNN is
trained on the training set with the goal of predicting a label for previously unseen image in the test set.
Given an input sample ��, we define a DNN classifier prediction as ()*+� = ,-.�,/% �'%� where 1 ≤ 1 ≤
� and '% denotes the value of output unit 1 (�%� for a network with � layers). Classification error rate is
defined as portion of images classified incorrectly: *))2) = ��∑ [()*+� ≠ ��]��6� .
3.3 Hyper parameter selection with cross validation
When training a neural network, some of the parameters, called hyper-parameters, are not tuned directly as
part of the error minimization problem, but rather are selected using cross validation. In the experiments
below, we tune the following hyper parameters: number of hidden units, number of hidden layers, learning
6
rate, seed for weight random initialization, momentum (Polyak 1964) and weight decay (Moody et al.
1995). We used 5-fold cross validation to select these hyper parameters.
Following Bengio (2012), we initialize the weights �� randomly from a uniform distribution with
range �−), )� where ) = 86/�;�� + ;<=>� , ;�� is the number of units at layer � and ;<=> is the number of
units at layer � + 1.
3.4 The cross entropy loss function
We use the cross entropy (XENT) loss function for DNN as in (Vincent et al. 2010), defined as
?@ABC&D, 'DE = −F [&1 log '1 + �1 − &1� log�1 − '1�]1
where &D is the expected output vector of the DNN, 'D is the actual output vector of the DNN, 1 iterates over
the vector dimensions and '% = �%� for a DNN with � layers (�%� is as defined in section 3.1).
The cross entropy loss is advantageous over the mean square error (MSE) loss in a DNN with sigmoid
output units. The derivative of KL@C&D, 'DE = 12∑ N&1 − '1O21 with respect to the activation of the output
unit �% is �'% − &%�'%�1 − '%�. Clearly, this derivative vanishes for ' values that are close to zero or to
one, causing the gradient steps to vanish. In contrast, the derivative of the cross entropy is '% − &% and does
not suffer from this problem. See Appendix 8.1 for detailed derivation.
3.5 Datasets
To evaluate our proposed algorithms, we conducted experiments on three benchmark datasets: MNIST
(Lecun et al. 1998), CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton 2009). MNIST contains 70,000
28-by-28 grayscale images, each containing a single hand-written digit. CIFAR-10 and CIFAR-100 contain
60,000 natural RGB images of 32-by-32 pixels from 10 or 100 categories respectively.
3.6 Implementation
We conduct all experiments using the ”MEDAL” framework, a MATLAB implementation of DNNs and
auto encoders (Stansbury 2012). MEDAL was chosen since development in MATLAB is fast and
convenient, though performance is slower than implementation in native language such as C.
4. DNNs with non-recurrent lateral connections
The idea of neural networks with lateral connections is not new. Smid (1994) showed that lateral
connections can reduce the required number of hidden units in a DNN in order to approximate a function,
but this work did not involve learning network weights with back propagation. Kothari and Agyepong
(1996) have experimented with simple lateral connectivity in form of a chain where unit in a hidden layer
is connected to unit + 1, showing that a small network with one hidden layer of 12 units can approximate
a function better with chain lateral connections than without. Larochelle et.al (2009) have used recurrent
lateral connections in denoising autoencoders (DAE) and showed that DAEs with lateral connections
outperform DAEs without lateral connections in a classification task on MNIST and OCR-letters datasets.
In this work we tested DNNs with non-recurrent lateral connections. This is motivated by the claim that
in the brain, lateral inhibition can be advantageous for a unit to propagate its signal to higher layers, while
silencing other “competitor” units in its layer (Stettler et al. 2002; Weliky et al. 1995).
7
4.1 An update rule with lateral connections
The activation function of a unit in the network was expanded to include the lateral (intra layer) connections:
��� = ∑ ���� ��� + ∑ P�%��%�%Q� , where P�%R is the weight on the connection from unit 1 to unit � at
layer � (compare to the definition of the activation in Section 3).
To avoid recurrence, we restricted the lateral connectivity of units in the same layer as follows. First,
self-connections were not allowed (∀�: P��R = 0). Second, each unit was set to be either an emitter unit –
with outgoing lateral connections, or a receiver unit – with incoming lateral connections. When computing
the activations of the DNN during the forward pass, the activations of emitter units are computed first, and
then they are used in the computation of activation of the receiver units.
4.2 Back propagation for a network with lateral connections
In a common feed forward neural network (NN), a unit is affected only by units at the previous layer through
its incoming weights (Bishop 2006; Rumelhart, Hinton, and Williams 1988) :
V��WV�W�
= V∑ ��/W�/ �/W�V�W�
= ��W��′��W�� where �, �, �, Y are as defined in section 3.
Adding lateral connections also adds to the complexity of implementation of the update rule, since a
unit is now affected also by units in the same layer. To simplify implementation, we describe a way to view
this network as a standard feed forward network with additional hidden layer with dummy (identity) units,
but without lateral connections. This reformulation allows to easily introduce lateral connections into
existing implementations of forward-only NN code.
To reformulate this network using standard NN formulation, we introduce dummy units, defined as a
unit that emits an output that is identical to its input. To illustrate, consider a simple network with one
hidden layer as depicted in Figure 1A. The network in Figure 2B is an equivalent network that has no lateral
(intra layer) connections, but does have connections between non-consecutive layers. For example, the
network has a connection between the unit � of the input layer and unit b of layer H2. Each dummy unit has
only one weight Y = 1 (all other weights are zero). The weights on the connections from layer H1 to layer
H2 are fixed, and are not updated during learning.
The resulting network shown in Figure 1C is a standard feed forward network that is equivalent to the
network of Figure 1A. It differs from a classic feed forward network in two aspects: it has units with two
different activation function types and the dummy layer’s weights are fixed. It can be implemented easily
by slightly modifying an existing implementation of a common feed forward network (see Appendix 8.2).
8
Figure 1 : Red arrows denote lateral connections. aɶ , bɶ and cɶ are dummy units. Green arrows
denote a connection with a weight Y = 1. H denotes the hidden layers. (a) A network with intra
layer connections. Unit � is an emitter unit. Units Z and [ are receiver units. (b) A network without
intra layer connections, but with connections between non adjacent layers. (c) A feed forward
network with dummy units. All three networks are equivalent.
4.3 Tuning hyper parameters for networks with lateral-connections
We considered several ways to set the weights of the lateral connections during training. These include
fixing the weights to a constant value, updating weights as the other weights with error back propagation,
or updating weights with weight sharing when the weights of all outgoing connections of an emitter unit
are the same. In the experiment described below, we took the first approach and set the lateral weights to
be fixed value selected using cross validation. This reduces the number of free parameters in the network,
and allows a more direct comparison with networks that do not have lateral weights.
The connectivity of lateral connections is determined by several hyper parameters: the fraction of
emitter units in a layer, the number of incoming connections to a receiver unit (like the incoming node
degree in a graph) and the value of the weight on the lateral connections (when all lateral connections share
the same weight). The values we considered for each hyper parameter are listed in Table 1.
9
Hyper Parameter Description Considered values
nHidden Number of hidden layers {1,2}
nUnits Number of hidden units {400,500,1000}
lRate Learning rate {0.01,0.001}
Momentum Momentum 0.9
wDecay Weight Decay {0.0001,0.001}
nBatch Batch size {10,20}
emitterPortion Portion of emitter units in a layer {0.2,0.4,0.6}
nEmitters Number of incoming connections
to a receiver unit
{1,3,6}
wLateral Weight on each lateral connection { 0,-0.2,-0.5,0.5}
Table 1 : List of hyper parameters we tuned for the laterally-connected neural network. The
last 3 parameters control the lateral connectivity. When wLateral is zero, the neural network
becomes identical to a classical network without lateral connections and serves as a control.
4.4 Experiments on data with non-uniform class distribution
When a standard DNN is trained with error back propagation, units in the same hidden layer evolve
independently of each other, since the only information that guides each unit is the back propagated error
signal from higher layers. One potential drawback of this independence is the herd effect: all units may
independently evolve to reduce the largest source of error in the training data (Christian and Lebiere 1990;
Kothari and Agyepong 1996). Network weights are initialized randomly to break symmetry and prevent
this potential problem (Bengio 2012), but it may not be enough as the initial variability tends to dissipate
as the network is trained.
When the class distribution is uniform in the training set, there is no single largest source of error. We
therefor tested our hypothesis in a setup where the distribution of classes is highly non-balanced. In this
setting we expected most of network units to represent the common classes in a distribution, while the rare
classes will be underrepresented.
4.5 Results for experiments with lateral connections in DNNs
We tested the effect of lateral connections on network that learns data with highly unbalanced classes. We
generated 20 random non-uniform distributions over the 10 digit classes, by choosing 5 of the 10 classes to
have 3000 train cases (the common digits), and the rest to have 600 train cases (the rare digits) with a total
of 18000 samples. This was compared with a datasets of 6000 samples selected with a uniform distribution
over classes. The non-uniform distribution had an entropy of 2.97 bits compared to a uniform distribution
with an entropy of 3.32 bits. We then tested the performance of a neural network without lateral connections
on these training sets.
Figure 2 compares the performance achieved on non-uniform distributions (with 18K samples) with
that achieved over uniform class distribution (with 6K samples). It shows that for most of the non-uniform
classes the classification error is lower, and this is most likely because more samples were available during
training. However, for some non-uniform distribution classification error is higher than the error on uniform
distribution and we hypothesized that it was because most of network units became tuned to the common
digits.
10
Figure 2: Performance of DNN without lateral connections on various non-uniform
distributions, sorted by classification error. Digit distribution is specified by its 5 common
(3000 train cases each), while the other digits are rare (600 train cases each). Error bars are
over 7 random seeds that determine which train cases are included from the original MNIST
database. The dashed horizontal red line shows performance on a uniform distribution, in
which each class has 600 train cases (the same as the rare digits in the non-uniform
distribution).
We then quantified the effect of adding lateral connections on distributions that exhibited higher
classification error than uniform distributions and on some other distributions for control. Table 2 describes
the results. The model with lateral connections was obtained by choosing the best configuration of hyper
parameters on the validation set. We found no significant difference in using lateral connections over a
network without lateral connections.
11
Digit Distribution DNN+Lateral DNN
01348 4.83 ± 0.17 4.46 ± 0.11
15789 4.22 ± 0.1 4.08 ± 0.12
01347 4.67 ± 0.19 4.48 ± 0.16
23479 3.64 ± 0.09 3.65 ± 0.09
Table 2: Classification error (percents) of classical DNN and DNN with lateral
connections (DNN+Lateral) is shown on different distributions (see Table 1). Digit
distribution is specified by its 5 common digits. Standard deviation of error was measured
on 8 different splits of a training set to train and validation sets. The following hyper
parameters were used in all the configurations (see Table 1): nHidden=1, nUnits=400,
lRate=0.01, momentum=0.9, wDecay=0.0001, nBatch=20.
We then repeated the experiment with more balanced distribution of 3000 train cases for the common
digits and 1000 train cases for the rare digits (entropy of 3.13 bits). Table 3 describes the results. Using
lateral connections yielded a small but not significant improvement in classification performance in most
distributions.
Digit Distribution DNN+Lateral DNN
01346 3.49 ± 0.14 3.60 ± 0.12
15789 3.15 ± 0.13 3.25 ± 0.13
01347 3.50 ± 0.11 3.57 ± 0.12
23479 3.00 ± 0.08 2.87 ± 0.07
01348 3.51 ± 0.09 3.53 ± 0.11
Table 3: Classification error (percents) of standard DNN and of DNN with lateral
connections (DNN+Lateral) is shown on different distributions (1000 samples for each rare
digit and 3000 samples for each common digit). Digit distribution is specified by its 5
common digits. Standard deviation of error was measured on 10 different splits of a training
set to train and validation sets. Hyper parameters are the same as in Table 2.
4.6 Conclusions
We have experimented with classification of data with highly unbalanced classes using DNNs. We have
shown that classification error of some unbalanced distribution is worse than uniform distribution even
though the number of samples per class in the non uniform distribution is greater or equal to the uniform
distribution.
We have attempted to improve classification performance using DNNs with non recurrent lateral
connections. However, we found no significant benifit in using lateral connections over a network without
lateral connections. Perhaps for lateral connections to be beneficial, it is necessary to assign in some way
the classes to which network units that emit lateral connections are tuned.
12
5. Gradual training of denoising auto encoders
A central approach in learning meaningful representations is to train a deep network for reconstructing
corrupted data. The idea is simple: given unlabeled data, a deep-network is given input-output pairs, where
the input consists of a corrupted version of an input sample and the output consists of the original non-
corrupted version which the network aims to reconstruct. Indeed, denoising autoencoders (DAE) (Vincent
et al. 2008) have been shown to extract meaningful features which allow to correct corrupted input data
(Xie, Xu, and Chen 2012). These representations can later be used to initialize a deep network for a
supervised learning task. It has been shown that in the small-data regime, good initializations can cut down
the training time and improve the classification accuracy of the supervised task (Vincent et al. 2008;
Larochelle, Erhan, and Vincent 2009; Erhan et al. 2010; Vincent et al. 2010).
Going beyond a single layer, it has been shown that training a multi-layer (deep) DAE can be achieved
efficiently by stacking single-layer DAEs and training them layer-by-layer (Vincent et al. 2010).
Specifically, a stacked denoising autoencoder (SDAE) is trained as follows (Figure 3). First, a single-layer
auto encoder is trained over the corrupted input data �\ and its weights are tuned (Figure 3a). Then, the
weights to the first hidden layer Y� are frozen, and the data is transformed to the hidden representation
(Figure 3b). This transformed input ℎ���� is then used to create a corrupted input to a second autoencoder
and so on (Figure 3c).
Stacked training has been shown to outperform training de-novo of a full deep network, presumably
because it provides better error signals to lower layers of the network (Erhan et al. 2009). However, stacked
training is greedy in the following sense: When the first layer is trained, it is tuned such that its features can
be directly used for reconstructing the corrupted input. Later on however, these features are used as input
to train more complex features. Comparing this with the process of reduced plasticity in natural neural
systems, early layers in mammalian visual system keep adapting for prolonged periods, and their synapses
remain plastic long after representations have been formed in high brain areas (Liu, Murray, and Jones
2004). We therefore turned to explore alternative training schedules for deep DAEs, which avoid freezing
early weights.
We test here ‘gradual training’, where training occurs layer-by-layer, but lower layers keep adapting
throughout training. We compare gradual training to stacked training and to a hybrid approach, all under a
fixed budget of training update steps. We then test gradual training as an initialization for supervised
learning, and quantify its performance as a function of dataset size. Gradual training provides a small but
consistent improvement in reconstruction error and classification error in the regime of mid-sized datasets.
5.1 Training denoising auto encoders
For completeness, we detail here the procedure for training stacked denoising autoencoders described
Vincent et al. (Vincent et al. 2010). Figure 3 describes the architecture and the main training phases. For
training the first layer with a training sample �, masking noise is used to create a corrupted noisy version
�\ (Figure 3a, “corrupt” arrow). A forward pass is taken, computing the hidden representation h� =Sigmoid�w�d�� and the output y = SigmoidCwf′dh�E. All weights are updated by propagating the error
gradient back through the network. Specifically, the loss function is often taken to be the cross entropy
between ' and � (Figure 3a, dotted arrow). This is repeated for other samples in a stochastic gradient descent
(SGD) fashion, and combined with momentum and weight decay to speed training. Importantly, on each
passage through the training set, the same sample � is corrupted with noise randomly and thus the DAE is
13
presented with many different corrupted versions of the same sample: ��g,�fg,… , ��g where � is the number
of passages through entire training set.
To train a deep network, multiple DAEs are stacked using greedy layer-wise training (Vincent et al.
2010). After the first DAE is trained, the learned encoding weights w� are fixed, and the data is mapped to
the hidden layer representation h� (Figure 3b, blank arrow). The second DAE is trained based on h��x� using the same procedure as the first layer (Figure 3b). Importantly, the corrupting noise is applied to the
hidden representation h��x� to create h�i, with the motivation being that injecting noise to the hidden layer
introduces variability of the more-abstract representation that was already learned by the network. Training
of subsequent layers follows the same procedure, injecting noise at higher and higher layers.
Often, this layer-wise training procedure is followed by a full back-propagation phase, where noise is
injected to the original input � and all layers are updated jointly. Then, the SDAE can be used to initialize
a deep network for a supervised classification task by replacing the top reconstruction layer with a (usually
multi-class) classification layer (called pretraining).
The rationale behind unsupervised pretraining of a deep network for a supervised task is as follows.
Searching the parameter space of deep architectures is a difficult task because the training criterion is non-
convex and involves many local minima. Random initialization of a deep architecture falls with very high
probability in the basin of attraction of a poor local minimum (Erhan et al. 2009). Unsupervised pre-training
initializes a deep architecture in a basin of attraction of gradient descent corresponding to better
generalization performance (Erhan et al. 2010).
Figure 3: Stacked-training of a stacked DAE with 3 hidden layers. � denotes an input sample and '
denotes the network output. Black layers are the ones used for computing the loss. Gray arrows denote
weights that are updated through back propagation, while blank arrows denote denote weights which
are not changed during training. Yj weights are discarded in subsequent training phases. Crosses
illustrate corrupted units. (a) Training the 1st hidden layer. (b) Training the 2nd hidden layer. Noise is
injected to h�, creating h�i. (c) Training the 3rd hidden layer.
14
5.2 Gradual training of deep DAEs
We describe an alternative gradual, scheme for training autoencoders. The basic idea is to train the deep
autoencoder layer-by-layer, but keep adapting the lower layers continuously. Noise injection is only applied
at the input level (Figure 4). The motivation for this procedure has two aspects. First, it allows lower weights
to take into account the higher representations during training, reducing the greedy nature of stacked
training. Second, denoising is applied to the input, rather than to a hidden representation learned in a greedy
way.
More specifically, the first layer is trained in the same way as in stacked training, producing the weights
w�. Then, when adding the second layer autoencoder, its weights wf are tuned jointly with w�. This is done
by using the weights w� to initialize the first layer and randomly initializing the weights of the second.
Given a training sample �, we generate a noisy version �\, feed it to the 2-layered DAE, and compute the
activation at the subsequent layers h� = Sigmoid�w�d��, hf = Sigmoid�wfdh�� and y = Sigmoid�wk′dhf�. Importantly, the loss function is now computed over the input �, and is used to update all the weights
including w� (Figure 4b). Similarly, if a 3rd layer is trained, it involves tuning w� and wf in addition to wk
and wl′ (Figure 4c).
There are therefore two main differences between gradual and stacked training of SDAE. First, in
gradual training, weights of lower layers are never fixed as in stacked training, but rather trained jointly
when tuning weights of a newly-added layer. Second, each training phase reconstructs a noisy version of
the input rather than a noisy version of a hidden-layer representation.
Figure 4: Gradual training of denoising auto encoder with 3 hidden layers. (a) Training 1st hidden
layer. (b) Training layers 1 + 2. (c) Training layers 1 + 2 + 3. In all panels, � denotes an input sample
and ' the network output. The loss is computed over the black layers. Gray arrows denote weights
that are updated through back propagation. Y ′ denotes weights used for decoding, and are discarded
in subsequent training phases. Crosses illustrate corrupted units.
15
5.3 Training procedure
Performance was evaluated on a test subset of 10,000 samples. When quantifying performance as a function
of dataset size, we create training subsets of different sizes while maintaining the class distribution uniform
as in the original training data.
Hyper parameters were selected using a second level of cross validation (10-fold CV for MNIST, 5-
fold for CIFAR), keeping a uniform distribution over classes. In the experiments below, we tune the
following hyper parameters: number of units in hidden layers (same for all layers: 1000,1500,2000,2500),
learning rate (10�, 10f, 10k, 5 × 10l, 10l, 5 × 10o, 10o) batch size for SGD (10,20), seed for
weight random initialization, momentum (0.9,0.7) (Polyak 1964) and weight decay (10k, 10l, 10o)
(Moody et al. 1995). The best performing configuration on the validation set was sought in a semi-automatic
fashion (Vincent et al. 2010) by running experiments in parallel on a large computation cluster with manual
guidance to avoid wasting resources on unnecessary parts of the configuration space. We used early
stopping by monitoring reconstruction error or classification error on the validation set, and stopped training
after 35 epochs without improvement. We used the parameters (weights) which yield the best performance
over the validation set. Reported results are the average over 3 different random train-validation splits.
Since gradual training involves updating lower layers, every presentation of a sample involves more
weight updates than in a single-layered DAE. We compare stacked and gradual training on a common
ground, by using the same ‘budget’ for weight update steps. For example, when training the second layer
for � epochs in gradual training, we allocate 2� training epochs for stacked training. The overall budget for
update steps was determined using early stopping, such that the reconstruction error on the validation set
in the last 10 epochs did not improve more than 0.5% in all training schemes.
Images were presented to DAE network as a vector composed of concatenated rows of images pixels.
RGB images (CIFAR10 and CIFAR100) were presented as a concatenated vector of 3 images (for each
RGB color). Masking noise is of �% was applied by randomly choosing �% of the pixels and setting them
to zero. In CIFAR datasets, we zeroed all 3 RGB colors of the pixel (resulting in black color).
5.4 Results for experiments with gradually trained DAEs
We evaluate gradual and stacked training in unsupervised task of image denoising. We then test these
training methods as an initialization for supervised learning, and quantify its performance as a function of
dataset size.
5.4.1 Unsupervised learning for denoising
We start by evaluating gradual training in an unsupervised task of image denoising. Here, the network is
trained to minimize a cross-entropy loss over corrupted images. In addition to stacked and gradual training,
we also tested a hybrid method that spends some epochs on tuning only the second layer (as in stacked
training), and then spends the rest of the training budget on both layers (as in gradual training). We define
the Stacked-vs-Gradual fraction 0 ≤ f ≤ 1 as the fraction of weight updates that occur during ‘stacked’-
type training. ; = 1 is equivalent to pure stacked training while ; = 0 is equivalent to pure gradual training.
Given a budget of n training epochs, we train the 2nd hidden layer with gradual training for ��1 − ;� epochs,
and with stacked training for 2�; epochs.
Figure 5 shows the test-set cross entropy error when training 2-layered DAEs, as a function of the
Stacked-vs-Gradual fraction. Pure gradual training achieved significant lower reconstruction error than any
16
mix of stacked and gradual training with the same budget of update steps. See Figure 7 for examples of
image denoising by deep DAEs.
Figure 5 : Reconstruction error of unsupervised training methods measured by cross-entropy loss.
Error bars are over 3 train-validation splits. The shown cross-entropy error is relative to the
minimum possible error, computed as the cross-entropy error of the original uncorrupted test set
with itself. All compared methods used the same budget of update operations.
(a) MNIST dataset. Images were corrupted with 15% masking noise. Network has 2 hidden layers
with 1000 units each. The 1st hidden layer is trained for 50 epochs. Total epoch budget for the 2nd
hidden layer is 80 epochs. (b) CIFAR-10 dataset. Images were corrupted with 10% masking noise.
Network architecture: 2 hidden layers, each with 1500 units. The 1st hidden layer is trained for 25
epochs. Total epoch budget for 2nd hidden layer is 70 epochs. (c) CIFAR-100 dataset. Noise
corruption level is 10%. Network architecture is 2 hidden layers with 2500 units each. 1st hidden
layer is trained for 35 epochs. Total epoch budget for 2nd hidden layer is 70 epochs.
We also evaluated the reconstruction error after an additional full tuning phase is performed in which
all weights are updated jointly for 80 epochs for MNIST and 70 epochs for CIFAR. In these training scheme,
pure gradual training (; = 0) also improved the reconstruction error over full stacked training (; = 1)
across all datasets (see Figure 6).
17
Figure 6 : Reconstruction error of unsupervised training methods followed by full tuning stage,
measured by cross-entropy loss. Error bars are over 3 train-validation splits. The shown cross-
entropy error is relative to the minimum possible error, computed as the cross-entropy error of the
original uncorrupted test set with itself. Hyper parameters and network architecture are the same as
in Figure 5. In all networks weights were updated jointly for additional 80 epochs for MNIST and
70 epochs for CIFAR.
18
Figure 7 : Examples of unsupervised image denoising. Left column shows original images from test
set. Middle column shows images from left column corrupted with masking noise (10% for
CIFAR10 and CIFAR100 and 15% for MNIST). Text caption shows cross entropy error of original
image and corrupted image (error is relative to the minimum possible error, computed as the cross-
entropy error of the original image with itself). Right column shows images denoised using DAEs
trained with pure gradual training (; = 0) which are shown in Figure 5. Text caption shows cross
entropy error of original image and denoised image.
5.4.2 Gradual-training DAE for initializing a network in a supervised task
We use DAEs trained in the previous experiment for initializing a deep network to solve a supervised
classification task. The network architecture is the same as SDAE architecture, except for the top layer. The
first two hidden layers are initialized with the first two layer weights of the SDAE (Y� and Yf in Figure
4b). We then add a top classification layer with output units matching the classes in the dataset, with
randomly initialized weights.
19
We train these networks on several subsets of each dataset to quantify the benefit of unsupervised
pretraining as a function of train-set size. Figure 8 traces the classification error as a function of training set
size, showing in text the percentage of relative improvement. These results suggest that initialization with
gradually-trained DAEs yields better classification accuracy than when initializing with stacked-trained
DAEs, and that this effect is mostly relevant for datasets with less than 50� samples.
The gradual training procedure described above differs from stacked training in two aspects: noise
injection at the input level and joint training of weights. To test which of these two contributes to the
superior performance we conducted the following experiment. We trained a network to reconstruct a noisy
version of the input, as in gradual training, but kept the weights of the 1st hidden layer fixed as in stacked
training.
The results of this experiments varied across datasets. In MNIST, injecting noise to the input while
freezing the first layer performed worse than gradual training, both in terms of cross entropy (in the
reconstruction task) and in terms of classification accuracy (in the supervised task). In CIFAR however,
training with freezing the first layer actually reduced reconstruction error compared with gradual training,
while achieving the same performance in the supervised task.
Figure 8: Classification error of supervised training initialized based on DAEs. Error bars are over
3 train-validation splits. Each curve shows a different pre-training type (see Figure 5). Text labels
show the percentage of error improvement of Stacked-vs-Gradual 0 pretraining (Figure 5) compared
to Stacked-vs-Gradual 1 pretraining (not shown in Figure 5). (a) MNIST. Two hidden layers with
1000 units each. (b) CIFAR-10. Two hidden layers with 1500 units each. (c) CIFAR-100. Two
hidden layers with 2500 units each.
We repeated the experiment shown in Figure 8 with DNNs initialized based on DAEs trained with final full
tuning stage (Figure 6). The results of this experiment varied across datasets (see Figure 9). For MNIST
and CIFAR10 classification error of Stacked-vs-Gradual 1 pretrained network was worse than Stacked-vs-
Gradual 0 pretraining, while on CIFAR100 Stacked-vs-Gradual 0 improved classification error.
20
Figure 9 : Classification error of supervised training initialized based on DAEs trained with final full
tuning stage. Error bars are over 3 train-validation splits. Each curve shows a different pre-training
type (see Figure 6). Text labels show the percentage of error improvement of Stacked-vs-Gradual 0
pretraining compared to Stacked-vs-Gradual 1 pretraining (see Figure 6).
(a) MNIST. Two hidden layers with 1000 units each. (b) CIFAR-10. Two hidden layers with 1500
units each. (c) CIFAR-100. Two hidden layers with 2500 units each.
6. Discussion
In this work we investigated two brain-inspired approaches for learning. First, we formulated DNNs with
non-recurrent lateral connections as feed forward DNNs with dummy units. We tested the effect of lateral
connections on network that learns data with highly unbalanced classes, but found the effect of adding
lateral connections to be non significant.
Second, we tested a ‘gradual training’ scheme for denoising auto encoders, which improves the
reconstruction error under a fixed training budget, as compared to stacked training. It also provided a small
but consistent improvement in classification error in the regime of mid-sized training sets. Comparing
stacked and gradual training can be viewed as the two extreme adaptation schemes: with stacked-learning
reflecting a zero learning rate for the lower layer, and gradual training reflecting a full learning rate. It
remains to test intermediate training schedules where the learning rate is being gradually reduced as a layer
is presented with examples.
21
7. References
Bengio, Yoshua. 2012. “Practical Recommendations for Gradient-Based Training of Deep Architectures.”
In Neural Networks: Tricks of the Trade, 437–78. Springer.
Bengio, Yoshua, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. “Greedy Layer-Wise Training
of Deep Networks.” Advances in Neural Information Processing Systems 19. MIT; 1998: 153.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Vol. 1. springer New York.
Bourne, James A, and Marcello G P Rosa. 2006. “Hierarchical Development of the Primate Visual Cortex,
as Revealed by Neurofilament Immunoreactivity: Early Maturation of the Middle Temporal Area
(MT).” Cerebral Cortex 16 (3). Oxford Univ Press: 405–14.
Christian, Scott Fahlman, and Christian Lebiere. 1990. “The Cascade-Correlation Learning Architecture.”
In Advances in Neural Information Processing Systems 2.
Erhan, Dumitru, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy
Bengio. 2010. “Why Does Unsupervised Pre-Training Help Deep Learning?” The Journal of Machine
Learning Research 11. JMLR. org: 625–60.
Erhan, Dumitru, Pierre-antoine Manzagol, Yoshua Bengio, Samy Bengio, Pascal Vincent, and Mountain
View. 2009. “The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-
Training.” International Conference on Artificial Intelligence and Statistics, 153–60.
Felleman, Daniel J, and David C Van Essen. 1991. “Distributed Hierarchical Processing in the Primate
Cerebral Cortex.” Cerebral Cortex 1 (1). Oxford Univ Press: 1–47.
Guillery, R W. 2005. “Is Postnatal Neocortical Maturation Hierarchical?” Trends in Neurosciences 28 (10):
512–17.
Hinton, Geoffrey, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The
Shared Views of Four Research Groups.” Signal Processing Magazine, IEEE 29 (6). IEEE: 82–97.
Hinton, Geoffrey, and Ruslan R Salakhutdinov. 2006. “Reducing the Dimensionality of Data with Neural
Networks.” Science 313 (5786). American Association for the Advancement of Science: 504–7.
Hubel, David H, and Torsten N Wiesel. 1959. “Receptive Fields of Single Neurones in the Cat’s Striate
Cortex.” The Journal of Physiology 148 (3). Blackwell Publishing: 574.
Kothari, Ravi, and Kwabena Agyepong. 1996. “On Lateral Connections in Feed-Forward Neural
Networks.” In Neural Networks, 1996., IEEE International Conference on, 1:13–18.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. “Learning Multiple Layers of Features from Tiny Images.”
Computer Science Department, University of Toronto, Tech. Rep. Citeseer.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep
Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–
1105.
Larochelle, Hugo, Dumitru Erhan, and Pascal Vincent. 2009. “Deep Learning Using Robust Interdependent
Codes.” In International Conference on Artificial Intelligence and Statistics, 312–19.
Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document
Recognition.” Proceedings of the IEEE 86 (11): 2278–2324.
Liu, Xiao-Bo, Karl D Murray, and Edward G Jones. 2004. “Switching of NMDA Receptor 2A and 2B
Subunits at Thalamic and Cortical Synapses during Early Postnatal Development.” The Journal of
Neuroscience : The Official Journal of the Society for Neuroscience 24 (40): 8885–95.
Minsky, Marvin, and Seymour A. Papert. 1969. “Perceptrons.” MIT Press.
Moody, J E, S J Hanson, Anders Krogh, and John A Hertz. 1995. “A Simple Weight Decay Can Improve
Generalization.” Advances in Neural Information Processing Systems 4: 950–57.
Polyak, Boris Teodorovich. 1964. “Some Methods of Speeding up the Convergence of Iteration Methods.”
USSR Computational Mathematics and Mathematical Physics 4 (5). Elsevier: 1–17.
Ranzato, Marc’Aurelio. 2014. “On Learning Where To Look.” Computer Vision and Pattern Recognition;
Learning. arXiv Preprint arXiv:1405.5488., April. http://arxiv.org/abs/1405.5488.
22
Razavian, Ali Sharif, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. “CNN Features
off-the-Shelf: An Astounding Baseline for Recognition.” arXiv Preprint arXiv:1403.6382.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1988. “Learning Representations by Back-
Propagating Errors.” Cognitive Modeling.
Smid, J. 1994. “Layered Neural Networks with Horizontal Connections Can Reduce the Number of Units.”
In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), 3:1346–50.
IEEE.
Stansbury, Dustin E. 2012. “Matlab Environment for Deep Architecture Learning.”
https://github.com/dustinstansbury/medal.
Stettler, Dan D, Aniruddha Das, Jean Bennett, and Charles D Gilbert. 2002. “Lateral Connectivity and
Contextual Interactions in Macaque Primary Visual Cortex.” Neuron 36 (4): 739–50.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. “Going Deeper with Convolutions.” arXiv
Preprint arXiv:1409.4842.
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. “Extracting and
Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International
Conference on Machine Learning, 1096–1103.
Vincent, Pascal, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010.
“Stacked Denoising Autoencoders : Learning Useful Representations in a Deep Network with a Local
Denoising Criterion.” The Journal of Machine Learning Research 11: 3371–3408.
Weliky, Michael, Karl Kandler, David Fitzpatrick, and Lawrence C. Katz. 1995. “Patterns of Excitation
and Inhibition Evoked by Horizontal Connections in Visual Cortex Share a Common Relationship to
Orientation Columns.” Neuron 15 (3): 541–52.
Xie, Junyuan, Linli Xu, and Enhong Chen. 2012. “Image Denoising and Inpainting with Deep Neural
Networks.” In Advances in Neural Information Processing Systems, 341–49.
Zeiler, Matthew D, and Rob Fergus. 2013. “Visualizing and Understanding Convolutional Neural
Networks.” arXiv Preprint arXiv:1311.2901.
23
8. Appendix
8.1 Derivation of loss functions
Using the notations in Section 3 we apply the chain rule to derive the cross entropy loss function with
respect to the activation of unit ��. 'D denotes network output and �D denotes target output.
[ ]
( )
( , ) log (1 ) log(1 )
( , ) (1 ) (1 ) (1 )
1 (1 )
(1 ) (1 )
( , ) ( , )
y g ai i
XENT x y x y x yi i i ii
XENT x y x x x y y xi i i i i i
y y y y yi i i i i
x x y y x y x yi i i i i i i i
y y y yi i i i
XENT x y XENT x y y g x yi i i
a y g ai i i i
=
=− + − −∑
∂ − − − −=− − =− =
∂ − −
− + − −=− =
− −
∂ ∂ ∂ ∂ −= = −
∂ ∂ ∂ ∂
� �
� �
� � � �
(1 )(1 )
iy y y xi i i iy yi i
− = −−
When using the mean square error function (MSE) the derivative is:
1 2( , ) [ ]2
( , )
( , ) ( , )( ) (1 )
MSE x y x yi ii
MSE x yy xi i
yi
MSE x y MSE x y y gi iy x y yi i i ia y g ai i i i
∑= −
∂= −
∂
∂ ∂ ∂ ∂= = − −
∂ ∂ ∂ ∂
� �
� �
� � � �
8.2 Implementing DNN with lateral connections using standard DNN formulation
We describe here in detail how to implement a DNN with lateral connections using a standard DNN
implementation in MEDAL framework. We use the notations of Section 3 and Figure 1.
DNN Initialization:
For each layer with lateral connections (H1), add another auxiliary hidden layer (H2):
1. Set all weights and biases to zero.
1.1 For each receiver unit, create dummy unit in H1. If a unit does not receive or emit lateral
connections, consider it as a receiver for this purpose (can be thought of as a receiver with lateral
weight equal to zero).
1.2 For each emitter unit, create dummy unit in H2.
2. Set dummy weights to initial value of 1 (their value will not be changed, as they will not be updated
during back propagation stage).
24
3. Set weights of the lateral connections.
Forward pass:
After calculating activation on all units in a layer, restore the dummy units activation to value before
activation to implement an identity activation function.
Back propagation:
1. After calculating derivative of activation for all units, we restore derivative for dummy hidden units
to 1.
2. Weights between H1 and H2 are not changed (dummy weights and lateral connections), but
l
i li
E
aδ =
∂
∂ is calculated as usual for each unit � at layer r. Therefor we disable parameters (weights)
update for weights between H1 and H2.
3. Biases for real (not dummy) units are updated by back propagation both at H1 and H2.