brain-inspired methods for training deep neural networks

BAR-ILAN UNIVERSITY

Brain-inspired methods for training

deep neural networks

Alexander Kalmanovich

Submitted in partial fulfillment of the requirements for the Master's

Degree in the Gonda Multidisciplinary Center for Brain Research,

Bar-Ilan University

Ramat Gan, Israel 2015

2

This work was carried out under the supervision of

Prof. Gal Chechik

From the Gonda Multidisciplinary Brain Research Center,

Bar-Ilan University

3

Contents

1. Abstract ......................................................................................................................................... 4

2. Introduction ................................................................................................................................... 4

3. Methods......................................................................................................................................... 5

3.1 Feed forward neural network .................................................................................................... 5

3.2 Multi class classification ........................................................................................................... 5

3.3 Hyper parameter selection with cross validation ...................................................................... 5

3.4 The cross entropy loss function ................................................................................................ 6

3.5 Datasets ..................................................................................................................................... 6

3.6 Implementation ......................................................................................................................... 6

4. DNNs with non-recurrent lateral connections ............................................................................... 6

4.1 An update rule with lateral connections .................................................................................... 7

4.2 Back propagation for a network with lateral connections ......................................................... 7

4.3 Tuning hyper parameters for networks with lateral-connections .............................................. 8

4.4 Experiments on data with non-uniform class distribution ........................................................ 9

4.5 Results for experiments with lateral connections in DNNs ...................................................... 9

4.6 Conclusions ............................................................................................................................. 11

5. Gradual training of denoising auto encoders .............................................................................. 12

5.1 Training denoising auto encoders ........................................................................................... 12

5.2 Gradual training of deep DAEs ............................................................................................... 14

5.3 Training procedure .................................................................................................................. 15

5.4 Results for experiments with gradually trained DAEs ............................................................ 15

5.4.1 Unsupervised learning for denoising .............................................................................. 15

5.4.2 Gradual-training DAE for initializing a network in a supervised task ............................ 18

6. Discussion ................................................................................................................................... 20

7. References ................................................................................................................................... 21

8. Appendix ..................................................................................................................................... 23

8.1 Derivation of loss functions .................................................................................................... 23

8.2 Implementing DNN with lateral connections using standard DNN formulation .................... 23

4

1. Abstract

Following recent advances in machine learning based on principles of learning and information processing

in the human brain, we investigate two brain-inspired approaches for training neural networks: (1) lateral

connections in deep neural networks (DNN) and (2) gradual training of deep denoising auto encoders

(DAE).

First we used DNNs with non-recurrent lateral connections in hidden layers. However, we found no

significant benefit in using lateral connections over a network without lateral connections.

We then investigated a training scheme of a deep DAE, where DAE layers are gradually added and

keep adapting as additional layers are added. We show that in the regime of mid-sized datasets, this gradual

training provides a small but consistent improvement over stacked training in both reconstruction quality

and classification error over stacked training on MNIST and CIFAR datasets.

2. Introduction

Deep neural networks (DNN) are powerful learning models that have recently achieved excellent

performance on visual and speech recognition problems (Hinton et al. 2012; Krizhevsky, Sutskever, and

Hinton 2012; Razavian et al. 2014). Many advances in DNNs are based on principles of learning and

information processing inspired by the human brain. Leading architectures of neural networks (Krizhevsky,

Sutskever, and Hinton 2012; Zeiler and Fergus 2013; Szegedy et al. 2014) have local receptive fields and

hierarchical representation from local to global features, and share some properties with sensory pathways

in biological neural systems. For instance, local receptive fields are a hallmark of the V1 area in the visual

cortex (Hubel and Wiesel 1959), and transition from local to global features is a hallmark of the dorsal and

ventral visual pathways (Felleman and Van Essen 1991). We conjecture that building an artificial system

for accurate image recognition – a task that humans perform much better than artificial systems, could

continue to benefit from drawing inspiration from principles of learning and development in neural systems

(Ranzato 2014).

In this work, we investigated two brain-inspired approaches for learning with DNNs: (1) non-recurrent

lateral connections in DNNs and (2) gradual training of deep denoising auto encoders.

Lateral connections between neurons in the primary visual cortex play an important role in visual

information processing (Stettler et al. 2002). In comparison, DNNs typically have either a feed forward

architecture or a recurrent architecture. Feed forward DNNs have no lateral connections and recurrent

architectures typically have strong recurrencies making them harder to train.

We test here DNNs with non-recurrent lateral connections. We hypothesized that lateral inhibition

could be advantageous for a unit to propagate its signal to higher layers, while silencing other “competitor”

units in its layer. However, we found no significant benefit in using the type of lateral connections that we

tested.

In the second part of this work we investigated a training scheme of a denoising auto encoder (DAE),

where DAE layers are gradually added and keep adapting as additional layers are added. In the regime of

small datasets, DNNs are difficult to train since the amount of information provided by labels may be small

compared to the number of free paramteres. This can be addressed by initializing networks through an

unsupervised phase. This approach was introduced using a greedy layer-wise unsupervised learning

algorithm for Deep Belief Networks (Hinton and Salakhutdinov 2006; Bengio et al. 2007). A similar

5

approach for training multi-layer (deep) DAE has been introduced by Vincent et al. (2010), where a deep

DAE is built by training a single hidden layer at each step, while freezing weights of lower layers.

Areas in the primate visual cortex mature in a gradual order (Guillery 2005; Bourne and Rosa 2006)

and early layers in mammalian visual system keep adapting for prolonged periods, and their synapses

remain plastic long after representations have been formed in high brain areas (Liu, Murray, and Jones

2004). We therefore turned to explore alternative training schedules for deep DAEs, which avoid freezing

early weights.

We test here ‘gradual training’, where training occurs layer-by-layer, but lower layers keep adapting

throughout training. We compare gradual training to stacked training and to a hybrid approach, all under a

fixed budget of training update steps. We then test gradual training as an initialization for supervised

learning, and quantify its performance as a function of dataset size. Gradual training provides a small but

consistent improvement in reconstruction error and classification error in the regime of mid-sized datasets.

3. Methods

3.1 Feed forward neural network

The machine learning model we use to solve a classification problem is a feed forward neural network

(Minsky and Papert 1969; Rumelhart, Hinton, and Williams 1988). In a standard feed forward DNN, the

units are layered, such that each unit is fully connected to all units in the previous layer but not to other

units in the same layer. The activation of unit � at layer � is �� = ∑ �� where �� denotes

weight of the connection from unit at layer � to unit � at layer � + 1. The output of unit � at layer � is

�� = �� where g is the activation function. All experiments described below use units with sigmoid

activation function: �� = ��. We denote the input layer as 1, so the forward pass is from layer � is

to layer � + 1.

3.2 Multi class classification

In the multi-class classification problem we aim to solve, a training set consists of � images ��, … , �� with

corresponding labels ��, … , ��. Each label belongs to one of � possible classes: �� ∈ { �, . . . , "}. A

classifier DNN has $ units in the input layer and � units in the output layer. For example when the input is

a raw image, $ is the number of pixels in the image (all images are of the same size in our case). During

training, when considering a sample �� that belongs to class %, we define target output vector &� such that

& is zero except for &% = 1. The loss is then computed over the output of the network '� and &�. A DNN is

trained on the training set with the goal of predicting a label for previously unseen image in the test set.

Given an input sample ��, we define a DNN classifier prediction as ()*+� = ,-.�,/% �'%� where 1 ≤ 1 ≤

� and '% denotes the value of output unit 1 (�%� for a network with � layers). Classification error rate is

defined as portion of images classified incorrectly: *))2) = ��∑ [()*+� ≠ ��]��6� .

3.3 Hyper parameter selection with cross validation

When training a neural network, some of the parameters, called hyper-parameters, are not tuned directly as

part of the error minimization problem, but rather are selected using cross validation. In the experiments

below, we tune the following hyper parameters: number of hidden units, number of hidden layers, learning

6

rate, seed for weight random initialization, momentum (Polyak 1964) and weight decay (Moody et al.

1995). We used 5-fold cross validation to select these hyper parameters.

Following Bengio (2012), we initialize the weights �� randomly from a uniform distribution with

range �−), )� where ) = 86/�;�� + ;<=>� , ;�� is the number of units at layer � and ;<=> is the number of

units at layer � + 1.

3.4 The cross entropy loss function

We use the cross entropy (XENT) loss function for DNN as in (Vincent et al. 2010), defined as

?@ABC&D, 'DE = −F [&1 log '1 + �1 − &1� log�1 − '1�]1

where &D is the expected output vector of the DNN, 'D is the actual output vector of the DNN, 1 iterates over

the vector dimensions and '% = �%� for a DNN with � layers (�%� is as defined in section 3.1).

The cross entropy loss is advantageous over the mean square error (MSE) loss in a DNN with sigmoid

output units. The derivative of KL@C&D, 'DE = 12∑ N&1 − '1O21 with respect to the activation of the output

unit �% is �'% − &%�'%�1 − '%�. Clearly, this derivative vanishes for ' values that are close to zero or to

one, causing the gradient steps to vanish. In contrast, the derivative of the cross entropy is '% − &% and does

not suffer from this problem. See Appendix 8.1 for detailed derivation.

3.5 Datasets

To evaluate our proposed algorithms, we conducted experiments on three benchmark datasets: MNIST

(Lecun et al. 1998), CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton 2009). MNIST contains 70,000

28-by-28 grayscale images, each containing a single hand-written digit. CIFAR-10 and CIFAR-100 contain

60,000 natural RGB images of 32-by-32 pixels from 10 or 100 categories respectively.

3.6 Implementation

We conduct all experiments using the ”MEDAL” framework, a MATLAB implementation of DNNs and

auto encoders (Stansbury 2012). MEDAL was chosen since development in MATLAB is fast and

convenient, though performance is slower than implementation in native language such as C.

4. DNNs with non-recurrent lateral connections

The idea of neural networks with lateral connections is not new. Smid (1994) showed that lateral

connections can reduce the required number of hidden units in a DNN in order to approximate a function,

but this work did not involve learning network weights with back propagation. Kothari and Agyepong

(1996) have experimented with simple lateral connectivity in form of a chain where unit in a hidden layer

is connected to unit + 1, showing that a small network with one hidden layer of 12 units can approximate

a function better with chain lateral connections than without. Larochelle et.al (2009) have used recurrent

lateral connections in denoising autoencoders (DAE) and showed that DAEs with lateral connections

outperform DAEs without lateral connections in a classification task on MNIST and OCR-letters datasets.

In this work we tested DNNs with non-recurrent lateral connections. This is motivated by the claim that

in the brain, lateral inhibition can be advantageous for a unit to propagate its signal to higher layers, while

silencing other “competitor” units in its layer (Stettler et al. 2002; Weliky et al. 1995).

7

4.1 An update rule with lateral connections

The activation function of a unit in the network was expanded to include the lateral (intra layer) connections:

�� = ∑ �� + ∑ P�%��%�%Q� , where P�%R is the weight on the connection from unit 1 to unit � at

layer � (compare to the definition of the activation in Section 3).

To avoid recurrence, we restricted the lateral connectivity of units in the same layer as follows. First,

self-connections were not allowed (∀�: P��R = 0). Second, each unit was set to be either an emitter unit –

with outgoing lateral connections, or a receiver unit – with incoming lateral connections. When computing

the activations of the DNN during the forward pass, the activations of emitter units are computed first, and

then they are used in the computation of activation of the receiver units.

4.2 Back propagation for a network with lateral connections

In a common feed forward neural network (NN), a unit is affected only by units at the previous layer through

its incoming weights (Bishop 2006; Rumelhart, Hinton, and Williams 1988) :

V��WV�W�

= V∑ ��/W�/ �/W�V�W�

= ��W��′��W�� where �, �, �, Y are as defined in section 3.

Adding lateral connections also adds to the complexity of implementation of the update rule, since a

unit is now affected also by units in the same layer. To simplify implementation, we describe a way to view

this network as a standard feed forward network with additional hidden layer with dummy (identity) units,

but without lateral connections. This reformulation allows to easily introduce lateral connections into

existing implementations of forward-only NN code.

To reformulate this network using standard NN formulation, we introduce dummy units, defined as a

unit that emits an output that is identical to its input. To illustrate, consider a simple network with one

hidden layer as depicted in Figure 1A. The network in Figure 2B is an equivalent network that has no lateral

(intra layer) connections, but does have connections between non-consecutive layers. For example, the

network has a connection between the unit � of the input layer and unit b of layer H2. Each dummy unit has

only one weight Y = 1 (all other weights are zero). The weights on the connections from layer H1 to layer

H2 are fixed, and are not updated during learning.

The resulting network shown in Figure 1C is a standard feed forward network that is equivalent to the

network of Figure 1A. It differs from a classic feed forward network in two aspects: it has units with two

different activation function types and the dummy layer’s weights are fixed. It can be implemented easily

by slightly modifying an existing implementation of a common feed forward network (see Appendix 8.2).

8

Figure 1 : Red arrows denote lateral connections. aɶ , bɶ and cɶ are dummy units. Green arrows

denote a connection with a weight Y = 1. H denotes the hidden layers. (a) A network with intra

layer connections. Unit � is an emitter unit. Units Z and [ are receiver units. (b) A network without

intra layer connections, but with connections between non adjacent layers. (c) A feed forward

network with dummy units. All three networks are equivalent.

4.3 Tuning hyper parameters for networks with lateral-connections

We considered several ways to set the weights of the lateral connections during training. These include

fixing the weights to a constant value, updating weights as the other weights with error back propagation,

or updating weights with weight sharing when the weights of all outgoing connections of an emitter unit

are the same. In the experiment described below, we took the first approach and set the lateral weights to

be fixed value selected using cross validation. This reduces the number of free parameters in the network,

and allows a more direct comparison with networks that do not have lateral weights.

The connectivity of lateral connections is determined by several hyper parameters: the fraction of

emitter units in a layer, the number of incoming connections to a receiver unit (like the incoming node

degree in a graph) and the value of the weight on the lateral connections (when all lateral connections share

the same weight). The values we considered for each hyper parameter are listed in Table 1.

9

Hyper Parameter Description Considered values

nHidden Number of hidden layers {1,2}

nUnits Number of hidden units {400,500,1000}

lRate Learning rate {0.01,0.001}

Momentum Momentum 0.9

wDecay Weight Decay {0.0001,0.001}

nBatch Batch size {10,20}

emitterPortion Portion of emitter units in a layer {0.2,0.4,0.6}

nEmitters Number of incoming connections

to a receiver unit

{1,3,6}

wLateral Weight on each lateral connection { 0,-0.2,-0.5,0.5}

Table 1 : List of hyper parameters we tuned for the laterally-connected neural network. The

last 3 parameters control the lateral connectivity. When wLateral is zero, the neural network

becomes identical to a classical network without lateral connections and serves as a control.

4.4 Experiments on data with non-uniform class distribution

When a standard DNN is trained with error back propagation, units in the same hidden layer evolve

independently of each other, since the only information that guides each unit is the back propagated error

signal from higher layers. One potential drawback of this independence is the herd effect: all units may

independently evolve to reduce the largest source of error in the training data (Christian and Lebiere 1990;

Kothari and Agyepong 1996). Network weights are initialized randomly to break symmetry and prevent

this potential problem (Bengio 2012), but it may not be enough as the initial variability tends to dissipate

as the network is trained.

When the class distribution is uniform in the training set, there is no single largest source of error. We

therefor tested our hypothesis in a setup where the distribution of classes is highly non-balanced. In this

setting we expected most of network units to represent the common classes in a distribution, while the rare

classes will be underrepresented.

4.5 Results for experiments with lateral connections in DNNs

We tested the effect of lateral connections on network that learns data with highly unbalanced classes. We

generated 20 random non-uniform distributions over the 10 digit classes, by choosing 5 of the 10 classes to

have 3000 train cases (the common digits), and the rest to have 600 train cases (the rare digits) with a total

of 18000 samples. This was compared with a datasets of 6000 samples selected with a uniform distribution

over classes. The non-uniform distribution had an entropy of 2.97 bits compared to a uniform distribution

with an entropy of 3.32 bits. We then tested the performance of a neural network without lateral connections

on these training sets.

Figure 2 compares the performance achieved on non-uniform distributions (with 18K samples) with

that achieved over uniform class distribution (with 6K samples). It shows that for most of the non-uniform

classes the classification error is lower, and this is most likely because more samples were available during

training. However, for some non-uniform distribution classification error is higher than the error on uniform

distribution and we hypothesized that it was because most of network units became tuned to the common

digits.

10

Figure 2: Performance of DNN without lateral connections on various non-uniform

distributions, sorted by classification error. Digit distribution is specified by its 5 common

(3000 train cases each), while the other digits are rare (600 train cases each). Error bars are

over 7 random seeds that determine which train cases are included from the original MNIST

database. The dashed horizontal red line shows performance on a uniform distribution, in

which each class has 600 train cases (the same as the rare digits in the non-uniform

distribution).

We then quantified the effect of adding lateral connections on distributions that exhibited higher

classification error than uniform distributions and on some other distributions for control. Table 2 describes

the results. The model with lateral connections was obtained by choosing the best configuration of hyper

parameters on the validation set. We found no significant difference in using lateral connections over a

network without lateral connections.

11

Digit Distribution DNN+Lateral DNN

01348 4.83 ± 0.17 4.46 ± 0.11

15789 4.22 ± 0.1 4.08 ± 0.12

01347 4.67 ± 0.19 4.48 ± 0.16

23479 3.64 ± 0.09 3.65 ± 0.09

Table 2: Classification error (percents) of classical DNN and DNN with lateral

connections (DNN+Lateral) is shown on different distributions (see Table 1). Digit

distribution is specified by its 5 common digits. Standard deviation of error was measured

on 8 different splits of a training set to train and validation sets. The following hyper

parameters were used in all the configurations (see Table 1): nHidden=1, nUnits=400,

lRate=0.01, momentum=0.9, wDecay=0.0001, nBatch=20.

We then repeated the experiment with more balanced distribution of 3000 train cases for the common

digits and 1000 train cases for the rare digits (entropy of 3.13 bits). Table 3 describes the results. Using

lateral connections yielded a small but not significant improvement in classification performance in most

distributions.

Digit Distribution DNN+Lateral DNN

01346 3.49 ± 0.14 3.60 ± 0.12

15789 3.15 ± 0.13 3.25 ± 0.13

01347 3.50 ± 0.11 3.57 ± 0.12

23479 3.00 ± 0.08 2.87 ± 0.07

01348 3.51 ± 0.09 3.53 ± 0.11

Table 3: Classification error (percents) of standard DNN and of DNN with lateral

connections (DNN+Lateral) is shown on different distributions (1000 samples for each rare

digit and 3000 samples for each common digit). Digit distribution is specified by its 5

common digits. Standard deviation of error was measured on 10 different splits of a training

set to train and validation sets. Hyper parameters are the same as in Table 2.

4.6 Conclusions

We have experimented with classification of data with highly unbalanced classes using DNNs. We have

shown that classification error of some unbalanced distribution is worse than uniform distribution even

though the number of samples per class in the non uniform distribution is greater or equal to the uniform

distribution.

We have attempted to improve classification performance using DNNs with non recurrent lateral

connections. However, we found no significant benifit in using lateral connections over a network without

lateral connections. Perhaps for lateral connections to be beneficial, it is necessary to assign in some way

the classes to which network units that emit lateral connections are tuned.

12

5. Gradual training of denoising auto encoders

A central approach in learning meaningful representations is to train a deep network for reconstructing

corrupted data. The idea is simple: given unlabeled data, a deep-network is given input-output pairs, where

the input consists of a corrupted version of an input sample and the output consists of the original non-

corrupted version which the network aims to reconstruct. Indeed, denoising autoencoders (DAE) (Vincent

et al. 2008) have been shown to extract meaningful features which allow to correct corrupted input data

(Xie, Xu, and Chen 2012). These representations can later be used to initialize a deep network for a

supervised learning task. It has been shown that in the small-data regime, good initializations can cut down

the training time and improve the classification accuracy of the supervised task (Vincent et al. 2008;

Larochelle, Erhan, and Vincent 2009; Erhan et al. 2010; Vincent et al. 2010).

Going beyond a single layer, it has been shown that training a multi-layer (deep) DAE can be achieved

efficiently by stacking single-layer DAEs and training them layer-by-layer (Vincent et al. 2010).

Specifically, a stacked denoising autoencoder (SDAE) is trained as follows (Figure 3). First, a single-layer

auto encoder is trained over the corrupted input data �\ and its weights are tuned (Figure 3a). Then, the

weights to the first hidden layer Y� are frozen, and the data is transformed to the hidden representation

(Figure 3b). This transformed input ℎ�� is then used to create a corrupted input to a second autoencoder

and so on (Figure 3c).

Stacked training has been shown to outperform training de-novo of a full deep network, presumably

because it provides better error signals to lower layers of the network (Erhan et al. 2009). However, stacked

training is greedy in the following sense: When the first layer is trained, it is tuned such that its features can

be directly used for reconstructing the corrupted input. Later on however, these features are used as input

to train more complex features. Comparing this with the process of reduced plasticity in natural neural

systems, early layers in mammalian visual system keep adapting for prolonged periods, and their synapses

remain plastic long after representations have been formed in high brain areas (Liu, Murray, and Jones

2004). We therefore turned to explore alternative training schedules for deep DAEs, which avoid freezing

early weights.

We test here ‘gradual training’, where training occurs layer-by-layer, but lower layers keep adapting

throughout training. We compare gradual training to stacked training and to a hybrid approach, all under a

fixed budget of training update steps. We then test gradual training as an initialization for supervised

learning, and quantify its performance as a function of dataset size. Gradual training provides a small but

consistent improvement in reconstruction error and classification error in the regime of mid-sized datasets.

5.1 Training denoising auto encoders

For completeness, we detail here the procedure for training stacked denoising autoencoders described

Vincent et al. (Vincent et al. 2010). Figure 3 describes the architecture and the main training phases. For

training the first layer with a training sample �, masking noise is used to create a corrupted noisy version

�\ (Figure 3a, “corrupt” arrow). A forward pass is taken, computing the hidden representation h� =Sigmoid�w�d�� and the output y = SigmoidCwf′dh�E. All weights are updated by propagating the error

gradient back through the network. Specifically, the loss function is often taken to be the cross entropy

between ' and � (Figure 3a, dotted arrow). This is repeated for other samples in a stochastic gradient descent

(SGD) fashion, and combined with momentum and weight decay to speed training. Importantly, on each

passage through the training set, the same sample � is corrupted with noise randomly and thus the DAE is

13

presented with many different corrupted versions of the same sample: ��g,�fg,… , ��g where � is the number

of passages through entire training set.

To train a deep network, multiple DAEs are stacked using greedy layer-wise training (Vincent et al.

2010). After the first DAE is trained, the learned encoding weights w� are fixed, and the data is mapped to

the hidden layer representation h� (Figure 3b, blank arrow). The second DAE is trained based on h��x� using the same procedure as the first layer (Figure 3b). Importantly, the corrupting noise is applied to the

hidden representation h��x� to create h�i, with the motivation being that injecting noise to the hidden layer

introduces variability of the more-abstract representation that was already learned by the network. Training

of subsequent layers follows the same procedure, injecting noise at higher and higher layers.

Often, this layer-wise training procedure is followed by a full back-propagation phase, where noise is

injected to the original input � and all layers are updated jointly. Then, the SDAE can be used to initialize

a deep network for a supervised classification task by replacing the top reconstruction layer with a (usually

multi-class) classification layer (called pretraining).

The rationale behind unsupervised pretraining of a deep network for a supervised task is as follows.

Searching the parameter space of deep architectures is a difficult task because the training criterion is non-

convex and involves many local minima. Random initialization of a deep architecture falls with very high

probability in the basin of attraction of a poor local minimum (Erhan et al. 2009). Unsupervised pre-training

initializes a deep architecture in a basin of attraction of gradient descent corresponding to better

generalization performance (Erhan et al. 2010).

Figure 3: Stacked-training of a stacked DAE with 3 hidden layers. � denotes an input sample and '

denotes the network output. Black layers are the ones used for computing the loss. Gray arrows denote

weights that are updated through back propagation, while blank arrows denote denote weights which

are not changed during training. Yj weights are discarded in subsequent training phases. Crosses

illustrate corrupted units. (a) Training the 1st hidden layer. (b) Training the 2nd hidden layer. Noise is

injected to h�, creating h�i. (c) Training the 3rd hidden layer.

14

5.2 Gradual training of deep DAEs

We describe an alternative gradual, scheme for training autoencoders. The basic idea is to train the deep

autoencoder layer-by-layer, but keep adapting the lower layers continuously. Noise injection is only applied

at the input level (Figure 4). The motivation for this procedure has two aspects. First, it allows lower weights

to take into account the higher representations during training, reducing the greedy nature of stacked

training. Second, denoising is applied to the input, rather than to a hidden representation learned in a greedy

way.

More specifically, the first layer is trained in the same way as in stacked training, producing the weights

w�. Then, when adding the second layer autoencoder, its weights wf are tuned jointly with w�. This is done

by using the weights w� to initialize the first layer and randomly initializing the weights of the second.

Given a training sample �, we generate a noisy version �\, feed it to the 2-layered DAE, and compute the

activation at the subsequent layers h� = Sigmoid�w�d��, hf = Sigmoid�wfdh�� and y = Sigmoid�wk′dhf�. Importantly, the loss function is now computed over the input �, and is used to update all the weights

including w� (Figure 4b). Similarly, if a 3rd layer is trained, it involves tuning w� and wf in addition to wk

and wl′ (Figure 4c).

There are therefore two main differences between gradual and stacked training of SDAE. First, in

gradual training, weights of lower layers are never fixed as in stacked training, but rather trained jointly

when tuning weights of a newly-added layer. Second, each training phase reconstructs a noisy version of

the input rather than a noisy version of a hidden-layer representation.

Figure 4: Gradual training of denoising auto encoder with 3 hidden layers. (a) Training 1st hidden

layer. (b) Training layers 1 + 2. (c) Training layers 1 + 2 + 3. In all panels, � denotes an input sample

and ' the network output. The loss is computed over the black layers. Gray arrows denote weights

that are updated through back propagation. Y ′ denotes weights used for decoding, and are discarded

in subsequent training phases. Crosses illustrate corrupted units.

15

5.3 Training procedure

Performance was evaluated on a test subset of 10,000 samples. When quantifying performance as a function

of dataset size, we create training subsets of different sizes while maintaining the class distribution uniform

as in the original training data.

Hyper parameters were selected using a second level of cross validation (10-fold CV for MNIST, 5-

fold for CIFAR), keeping a uniform distribution over classes. In the experiments below, we tune the

following hyper parameters: number of units in hidden layers (same for all layers: 1000,1500,2000,2500),

learning rate (10�, 10f, 10k, 5 × 10l, 10l, 5 × 10o, 10o) batch size for SGD (10,20), seed for

weight random initialization, momentum (0.9,0.7) (Polyak 1964) and weight decay (10k, 10l, 10o)

(Moody et al. 1995). The best performing configuration on the validation set was sought in a semi-automatic

fashion (Vincent et al. 2010) by running experiments in parallel on a large computation cluster with manual

guidance to avoid wasting resources on unnecessary parts of the configuration space. We used early

stopping by monitoring reconstruction error or classification error on the validation set, and stopped training

after 35 epochs without improvement. We used the parameters (weights) which yield the best performance

over the validation set. Reported results are the average over 3 different random train-validation splits.

Since gradual training involves updating lower layers, every presentation of a sample involves more

weight updates than in a single-layered DAE. We compare stacked and gradual training on a common

ground, by using the same ‘budget’ for weight update steps. For example, when training the second layer

for � epochs in gradual training, we allocate 2� training epochs for stacked training. The overall budget for

update steps was determined using early stopping, such that the reconstruction error on the validation set

in the last 10 epochs did not improve more than 0.5% in all training schemes.

Images were presented to DAE network as a vector composed of concatenated rows of images pixels.

RGB images (CIFAR10 and CIFAR100) were presented as a concatenated vector of 3 images (for each

RGB color). Masking noise is of �% was applied by randomly choosing �% of the pixels and setting them

to zero. In CIFAR datasets, we zeroed all 3 RGB colors of the pixel (resulting in black color).

5.4 Results for experiments with gradually trained DAEs

We evaluate gradual and stacked training in unsupervised task of image denoising. We then test these

training methods as an initialization for supervised learning, and quantify its performance as a function of

dataset size.

5.4.1 Unsupervised learning for denoising

We start by evaluating gradual training in an unsupervised task of image denoising. Here, the network is

trained to minimize a cross-entropy loss over corrupted images. In addition to stacked and gradual training,

we also tested a hybrid method that spends some epochs on tuning only the second layer (as in stacked

training), and then spends the rest of the training budget on both layers (as in gradual training). We define

the Stacked-vs-Gradual fraction 0 ≤ f ≤ 1 as the fraction of weight updates that occur during ‘stacked’-

type training. ; = 1 is equivalent to pure stacked training while ; = 0 is equivalent to pure gradual training.

Given a budget of n training epochs, we train the 2nd hidden layer with gradual training for ��1 − ;� epochs,

and with stacked training for 2�; epochs.

Figure 5 shows the test-set cross entropy error when training 2-layered DAEs, as a function of the

Stacked-vs-Gradual fraction. Pure gradual training achieved significant lower reconstruction error than any

16

mix of stacked and gradual training with the same budget of update steps. See Figure 7 for examples of

image denoising by deep DAEs.

Figure 5 : Reconstruction error of unsupervised training methods measured by cross-entropy loss.

Error bars are over 3 train-validation splits. The shown cross-entropy error is relative to the

minimum possible error, computed as the cross-entropy error of the original uncorrupted test set

with itself. All compared methods used the same budget of update operations.

(a) MNIST dataset. Images were corrupted with 15% masking noise. Network has 2 hidden layers

with 1000 units each. The 1st hidden layer is trained for 50 epochs. Total epoch budget for the 2nd

hidden layer is 80 epochs. (b) CIFAR-10 dataset. Images were corrupted with 10% masking noise.

Network architecture: 2 hidden layers, each with 1500 units. The 1st hidden layer is trained for 25

epochs. Total epoch budget for 2nd hidden layer is 70 epochs. (c) CIFAR-100 dataset. Noise

corruption level is 10%. Network architecture is 2 hidden layers with 2500 units each. 1st hidden

layer is trained for 35 epochs. Total epoch budget for 2nd hidden layer is 70 epochs.

We also evaluated the reconstruction error after an additional full tuning phase is performed in which

all weights are updated jointly for 80 epochs for MNIST and 70 epochs for CIFAR. In these training scheme,

pure gradual training (; = 0) also improved the reconstruction error over full stacked training (; = 1)

across all datasets (see Figure 6).

17

Figure 6 : Reconstruction error of unsupervised training methods followed by full tuning stage,

measured by cross-entropy loss. Error bars are over 3 train-validation splits. The shown cross-

entropy error is relative to the minimum possible error, computed as the cross-entropy error of the

original uncorrupted test set with itself. Hyper parameters and network architecture are the same as

in Figure 5. In all networks weights were updated jointly for additional 80 epochs for MNIST and

70 epochs for CIFAR.

18

Figure 7 : Examples of unsupervised image denoising. Left column shows original images from test

set. Middle column shows images from left column corrupted with masking noise (10% for

CIFAR10 and CIFAR100 and 15% for MNIST). Text caption shows cross entropy error of original

image and corrupted image (error is relative to the minimum possible error, computed as the cross-

entropy error of the original image with itself). Right column shows images denoised using DAEs

trained with pure gradual training (; = 0) which are shown in Figure 5. Text caption shows cross

entropy error of original image and denoised image.

5.4.2 Gradual-training DAE for initializing a network in a supervised task

We use DAEs trained in the previous experiment for initializing a deep network to solve a supervised

classification task. The network architecture is the same as SDAE architecture, except for the top layer. The

first two hidden layers are initialized with the first two layer weights of the SDAE (Y� and Yf in Figure

4b). We then add a top classification layer with output units matching the classes in the dataset, with

randomly initialized weights.

19

We train these networks on several subsets of each dataset to quantify the benefit of unsupervised

pretraining as a function of train-set size. Figure 8 traces the classification error as a function of training set

size, showing in text the percentage of relative improvement. These results suggest that initialization with

gradually-trained DAEs yields better classification accuracy than when initializing with stacked-trained

DAEs, and that this effect is mostly relevant for datasets with less than 50� samples.

The gradual training procedure described above differs from stacked training in two aspects: noise

injection at the input level and joint training of weights. To test which of these two contributes to the

superior performance we conducted the following experiment. We trained a network to reconstruct a noisy

version of the input, as in gradual training, but kept the weights of the 1st hidden layer fixed as in stacked

training.

The results of this experiments varied across datasets. In MNIST, injecting noise to the input while

freezing the first layer performed worse than gradual training, both in terms of cross entropy (in the

reconstruction task) and in terms of classification accuracy (in the supervised task). In CIFAR however,

training with freezing the first layer actually reduced reconstruction error compared with gradual training,

while achieving the same performance in the supervised task.

Figure 8: Classification error of supervised training initialized based on DAEs. Error bars are over

3 train-validation splits. Each curve shows a different pre-training type (see Figure 5). Text labels

show the percentage of error improvement of Stacked-vs-Gradual 0 pretraining (Figure 5) compared

to Stacked-vs-Gradual 1 pretraining (not shown in Figure 5). (a) MNIST. Two hidden layers with

1000 units each. (b) CIFAR-10. Two hidden layers with 1500 units each. (c) CIFAR-100. Two

hidden layers with 2500 units each.

We repeated the experiment shown in Figure 8 with DNNs initialized based on DAEs trained with final full

tuning stage (Figure 6). The results of this experiment varied across datasets (see Figure 9). For MNIST

and CIFAR10 classification error of Stacked-vs-Gradual 1 pretrained network was worse than Stacked-vs-

Gradual 0 pretraining, while on CIFAR100 Stacked-vs-Gradual 0 improved classification error.

20

Figure 9 : Classification error of supervised training initialized based on DAEs trained with final full

tuning stage. Error bars are over 3 train-validation splits. Each curve shows a different pre-training

type (see Figure 6). Text labels show the percentage of error improvement of Stacked-vs-Gradual 0

pretraining compared to Stacked-vs-Gradual 1 pretraining (see Figure 6).

(a) MNIST. Two hidden layers with 1000 units each. (b) CIFAR-10. Two hidden layers with 1500

units each. (c) CIFAR-100. Two hidden layers with 2500 units each.

6. Discussion

In this work we investigated two brain-inspired approaches for learning. First, we formulated DNNs with

non-recurrent lateral connections as feed forward DNNs with dummy units. We tested the effect of lateral

connections on network that learns data with highly unbalanced classes, but found the effect of adding

lateral connections to be non significant.

Second, we tested a ‘gradual training’ scheme for denoising auto encoders, which improves the

reconstruction error under a fixed training budget, as compared to stacked training. It also provided a small

but consistent improvement in classification error in the regime of mid-sized training sets. Comparing

stacked and gradual training can be viewed as the two extreme adaptation schemes: with stacked-learning

reflecting a zero learning rate for the lower layer, and gradual training reflecting a full learning rate. It

remains to test intermediate training schedules where the learning rate is being gradually reduced as a layer

is presented with examples.

21

7. References

Bengio, Yoshua. 2012. “Practical Recommendations for Gradient-Based Training of Deep Architectures.”

In Neural Networks: Tricks of the Trade, 437–78. Springer.

Bengio, Yoshua, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. “Greedy Layer-Wise Training

of Deep Networks.” Advances in Neural Information Processing Systems 19. MIT; 1998: 153.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Vol. 1. springer New York.

Bourne, James A, and Marcello G P Rosa. 2006. “Hierarchical Development of the Primate Visual Cortex,

as Revealed by Neurofilament Immunoreactivity: Early Maturation of the Middle Temporal Area

(MT).” Cerebral Cortex 16 (3). Oxford Univ Press: 405–14.

Christian, Scott Fahlman, and Christian Lebiere. 1990. “The Cascade-Correlation Learning Architecture.”

In Advances in Neural Information Processing Systems 2.

Erhan, Dumitru, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy

Bengio. 2010. “Why Does Unsupervised Pre-Training Help Deep Learning?” The Journal of Machine

Learning Research 11. JMLR. org: 625–60.

Erhan, Dumitru, Pierre-antoine Manzagol, Yoshua Bengio, Samy Bengio, Pascal Vincent, and Mountain

View. 2009. “The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-

Training.” International Conference on Artificial Intelligence and Statistics, 153–60.

Felleman, Daniel J, and David C Van Essen. 1991. “Distributed Hierarchical Processing in the Primate

Cerebral Cortex.” Cerebral Cortex 1 (1). Oxford Univ Press: 1–47.

Guillery, R W. 2005. “Is Postnatal Neocortical Maturation Hierarchical?” Trends in Neurosciences 28 (10):

512–17.

Hinton, Geoffrey, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew

Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The

Shared Views of Four Research Groups.” Signal Processing Magazine, IEEE 29 (6). IEEE: 82–97.

Hinton, Geoffrey, and Ruslan R Salakhutdinov. 2006. “Reducing the Dimensionality of Data with Neural

Networks.” Science 313 (5786). American Association for the Advancement of Science: 504–7.

Hubel, David H, and Torsten N Wiesel. 1959. “Receptive Fields of Single Neurones in the Cat’s Striate

Cortex.” The Journal of Physiology 148 (3). Blackwell Publishing: 574.

Kothari, Ravi, and Kwabena Agyepong. 1996. “On Lateral Connections in Feed-Forward Neural

Networks.” In Neural Networks, 1996., IEEE International Conference on, 1:13–18.

Krizhevsky, Alex, and Geoffrey Hinton. 2009. “Learning Multiple Layers of Features from Tiny Images.”

Computer Science Department, University of Toronto, Tech. Rep. Citeseer.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep

Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–

1105.

Larochelle, Hugo, Dumitru Erhan, and Pascal Vincent. 2009. “Deep Learning Using Robust Interdependent

Codes.” In International Conference on Artificial Intelligence and Statistics, 312–19.

Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document

Recognition.” Proceedings of the IEEE 86 (11): 2278–2324.

Liu, Xiao-Bo, Karl D Murray, and Edward G Jones. 2004. “Switching of NMDA Receptor 2A and 2B

Subunits at Thalamic and Cortical Synapses during Early Postnatal Development.” The Journal of

Neuroscience : The Official Journal of the Society for Neuroscience 24 (40): 8885–95.

Minsky, Marvin, and Seymour A. Papert. 1969. “Perceptrons.” MIT Press.

Moody, J E, S J Hanson, Anders Krogh, and John A Hertz. 1995. “A Simple Weight Decay Can Improve

Generalization.” Advances in Neural Information Processing Systems 4: 950–57.

Polyak, Boris Teodorovich. 1964. “Some Methods of Speeding up the Convergence of Iteration Methods.”

USSR Computational Mathematics and Mathematical Physics 4 (5). Elsevier: 1–17.

Ranzato, Marc’Aurelio. 2014. “On Learning Where To Look.” Computer Vision and Pattern Recognition;

Learning. arXiv Preprint arXiv:1405.5488., April. http://arxiv.org/abs/1405.5488.

22

Razavian, Ali Sharif, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. “CNN Features

off-the-Shelf: An Astounding Baseline for Recognition.” arXiv Preprint arXiv:1403.6382.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1988. “Learning Representations by Back-

Propagating Errors.” Cognitive Modeling.

Smid, J. 1994. “Layered Neural Networks with Horizontal Connections Can Reduce the Number of Units.”

In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), 3:1346–50.

IEEE.

Stansbury, Dustin E. 2012. “Matlab Environment for Deep Architecture Learning.”

https://github.com/dustinstansbury/medal.

Stettler, Dan D, Aniruddha Das, Jean Bennett, and Charles D Gilbert. 2002. “Lateral Connectivity and

Contextual Interactions in Macaque Primary Visual Cortex.” Neuron 36 (4): 739–50.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru

Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. “Going Deeper with Convolutions.” arXiv

Preprint arXiv:1409.4842.

Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. “Extracting and

Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International

Conference on Machine Learning, 1096–1103.

Vincent, Pascal, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010.

“Stacked Denoising Autoencoders : Learning Useful Representations in a Deep Network with a Local

Denoising Criterion.” The Journal of Machine Learning Research 11: 3371–3408.

Weliky, Michael, Karl Kandler, David Fitzpatrick, and Lawrence C. Katz. 1995. “Patterns of Excitation

and Inhibition Evoked by Horizontal Connections in Visual Cortex Share a Common Relationship to

Orientation Columns.” Neuron 15 (3): 541–52.

Xie, Junyuan, Linli Xu, and Enhong Chen. 2012. “Image Denoising and Inpainting with Deep Neural

Networks.” In Advances in Neural Information Processing Systems, 341–49.

Zeiler, Matthew D, and Rob Fergus. 2013. “Visualizing and Understanding Convolutional Neural

Networks.” arXiv Preprint arXiv:1311.2901.

23

8. Appendix

8.1 Derivation of loss functions

Using the notations in Section 3 we apply the chain rule to derive the cross entropy loss function with

respect to the activation of unit ��. 'D denotes network output and �D denotes target output.

[ ]

( )

( , ) log (1 ) log(1 )

( , ) (1 ) (1 ) (1 )

1 (1 )

(1 ) (1 )

( , ) ( , )

y g ai i

XENT x y x y x yi i i ii

XENT x y x x x y y xi i i i i i

y y y y yi i i i i

x x y y x y x yi i i i i i i i

y y y yi i i i

XENT x y XENT x y y g x yi i i

a y g ai i i i

=

=− + − −∑

∂ − − − −=− − =− =

∂ − −

− + − −=− =

− −

∂ ∂ ∂ ∂ −= = −

∂ ∂ ∂ ∂

� �

� �

� � � �

(1 )(1 )

iy y y xi i i iy yi i

− = −−

When using the mean square error function (MSE) the derivative is:

1 2( , ) [ ]2

( , )

( , ) ( , )( ) (1 )

MSE x y x yi ii

MSE x yy xi i

yi

MSE x y MSE x y y gi iy x y yi i i ia y g ai i i i

∑= −

∂= −

∂

∂ ∂ ∂ ∂= = − −

∂ ∂ ∂ ∂

� �

� �

� � � �

8.2 Implementing DNN with lateral connections using standard DNN formulation

We describe here in detail how to implement a DNN with lateral connections using a standard DNN

implementation in MEDAL framework. We use the notations of Section 3 and Figure 1.

DNN Initialization:

For each layer with lateral connections (H1), add another auxiliary hidden layer (H2):

1. Set all weights and biases to zero.

1.1 For each receiver unit, create dummy unit in H1. If a unit does not receive or emit lateral

connections, consider it as a receiver for this purpose (can be thought of as a receiver with lateral

weight equal to zero).

1.2 For each emitter unit, create dummy unit in H2.

2. Set dummy weights to initial value of 1 (their value will not be changed, as they will not be updated

during back propagation stage).

24

3. Set weights of the lateral connections.

Forward pass:

After calculating activation on all units in a layer, restore the dummy units activation to value before

activation to implement an identity activation function.

Back propagation:

1. After calculating derivative of activation for all units, we restore derivative for dummy hidden units

to 1.

2. Weights between H1 and H2 are not changed (dummy weights and lateral connections), but

l

i li

E

aδ =

∂

∂ is calculated as usual for each unit � at layer r. Therefor we disable parameters (weights)

update for weights between H1 and H2.

3. Biases for real (not dummy) units are updated by back propagation both at H1 and H2.

brain-inspired methods for training deep neural networks

Documents