exploring diversity of task and data in artiﬁcial neural ...€¦ · [email protected] david...

Exploring diversity of task and data in artificial neural networks

Daniel GoodwinMIT Media [email protected]

David RolnickMIT Math Department

[email protected]

Abstract

In this project, we used a known convolutional networkas a substrate for exploring the impact of diversity in arti-ficial neural network development. Our testbed was 16,000annotated images on the two separate tasks of age andgender classification. Based on neurobiology and cogni-tive science, we define diversity as variance in data, taskand shared internal representation. From this, we ran ex-periments testing hypotheses of sparsity, noise and sharedknowledge between neural systems. In effect, we explored amicrocosm of a cognitive architecture.

1. IntroductionOne avenue to exploring the future of AI is to think how

the successful sub-symbolic systems such as convolutionalnetworks may used a substrate to explore pieces of a cog-nitive architecture. For this class project, we have exploredthe important of diversity in learning from the human per-spective and explored if such principles could be expressedin multilayer convolutional networks.

1.1. Motivation and Prior Work

As children we are exposed to many different lessonsfrom many different domains. We aggregate these multi-ple learnings into a form of general intelligence with whichwe improve our efficiency in future learning. There are twopieces of prior work that are of particular relevance. In oneof his papers utilizing the LEABRA cognitive architecture,Randall O’Reilly demonstrates a modeled multipart neuralarchitecture that includes four neurons that are dedicated toidentifying the task. If O’Reilly et al represents a more the-oretical end of the artificial neural network spectrum, thenthe multitask learning research is on the more applied side.

The idea of training a multilayer convolutional networkon both a primary task and a secondary task was presentedin 2008 by Collobert et al and in 2013 by Seltzer et al,achieving state of the art performances in natural languageprocessing and speech recognition, respectively. The goalof this project is to chart a course down the center of the

Figure 1. The human ventral stream has a many-layered hierarchybefore the various cortical subregions work on specific subtaskssuch as facial recognition or physical expectation

Figure 2. An application of O’Reilly’s LEABRA architecture [3]that (a) includes an extra four neurons to encode and recognizethe variety of tasks

spectrum between theoretical and applied: can we utilizeexisting convnet infrastructure to show that a neural net-work can benefit from diversity just as the human brain?

1.2. Defining Diversity

For the context of this research, diversity is defined as:

1. Variation of training and testing data In the context ofa single machine learning task, how does the varianceand novelty contribute to the overall performance ofthe algorithm?

2. Multiple tasks Can a single architecture simultane-ously learn and perform multiple tasks? Specifically,can the nature of learning multiple tasks improve theindividual performances of each task?

3. Variance in internal state Can internal diversity toa learned representation or inside computational unitcontribute to improved performance of the network?

1

Figure 3. Synaptic pruning is a lifelong process starting after apeak synaptic density early in development. Taken from [2]

This variance could be either in the neurons recruitedto a specific task or a modulation in the output.

1.3. Neuroscience Background

There are several concepts in neurobiology that inspirethe three aspects of diversity enumerated above. Noveltyis essential for developing critical neural circuitry, perhapsbest illustrated by the classic 1970 work of Blakermore andCooper showing that without experiencing horizontal linesearly in development, the feline visual system will be unableto perceive horizontal lines later in life [6].

There are a multitude of shared architectures in thebrain, perhaps best illustrated by the hierarchical per-ceptual systems. The ventral stream, for example, pro-cesses raw information through many layers (RGC →LGB→V1→V2→V4) before the cortical subareas do thetask-specific cognition such as facial recognition or objectdetection. It is an inevitability that designing shared archi-tectures across multiple tasks and dominas will be an essen-tial step in artificial neural networks.

Finally, we know that noise contributes to the informa-tion storage and processing ability of the brain, both on thescales of networks and individual cells. For example, AlanKay and John White showed in a 1998 piece that vary-ing modeled noise in the ion channels could induce qual-itatively diffenent behavior modes in a neuron, indicatingan improvement richness of representation that a networkcould contain. [1]

2. MethodsUsing the Adience Benchmark dataset, we had access to

16,000 cell phone images of faces that have been annotatedby age, gender and head-tilt and aligned to a consistent faceposition. The distribution of data used in this paper 4. Weused the 6-layer convNet architecture used in [4], imple-

Figure 4. Distribution of face data in the Adience [5] dataset

mented the network in Caffe and trained on a machine with256GB RAM and a NVIDIA GPU. With this setup, training8000 iterations for two networks on 5GB of data took about70 minutes, and we completed over 20 training experimentsthroughout this project.

Using this well-annotated dataset, we explored how thetwo seperate cognitive tasks of age classification and genderidentification could benefit from differing amounts of datavariance, filter sparsity, filter noise and shared number oflower-level layers. Below is an explanation of each modifi-cation based on the diversity principles enumerated above.

2.1. Shared Network layers

Extending ideas from [8] and [9], we used the same neu-ral network architecture for the two separate tasks of ageand gender classification, sharing some subset of the firstconvolutional layers. This is diagramed in Figure [?] andimplemented in Caffe.

2.2. Progressive Sparsity

Extending ideas from Dropout [7] and Optimal BrainDamage [11], we explored a method of sparsity in the con-volutional layers. Part of this project was discovering howmany variants of this technique have been done in the field:our options were either to reduce redundant filters througha low-rank approximation such as [10] or to selective zero-out individual weights per filter. For rapid prototyping inthe context of a class project, the fastest way to explore this

2

Figure 5. Showing the system

Figure 6. Sparsification loop

effect was to identify the smallest weighed elements per fil-ter and clamp them to zero. That is, as the network trainedthrough gradient descent of the backpropogation error, thepre-identified smalled weights that had been set to zero werekept to zero.

So the system, shown in Figure 6 is explained in stepsbelow. Note that the sparisification schedule, also shownin Figure 6 was chosen to be [0.0%, 5%, 10%, 15%, ] acrossthe four training iteractions

1. Initialize networks

(a) Train Age network (training data: 6000 femalefaces, testing data 1000 male faces)

(b) Initialize Gender network with the first few con-volutional layers of the Age network

(c) Train Gender network (training data: 3000 malefaces, 3000 female faces. Testing data 1000 facesof both gender)

2. For training age network:

(a) Copy the trained core layers from Gender net-work back into the age network

(b) Identify the smallest magnitude x% weights perfilter

(c) In each step of gradient descent in training, clampthose weights to zero

Figure 7. Testing results on an all-male image set basedon varying the training data from all-male (blue line) toall-female (purple line). The legend is in the formatNmale,trainingNfemale,trainingNmale,testingNfemale,testing

3. For training gender network:

(a) Copy the trained core layers from Age networkback into the gender network

(b) Identify the smallest magnitude x% weights perfilter

(c) In each step of gradient descent in training, clampthose weights to zero

4. Repeat steps (2) and (3) for 4 iterations.

2.3. Noise

In a very similar implementation to the sparsificationprocess, progressive amounts of noise are added. In thiscase, the progressive noise parameter was a function ofstatistics of the weights per filter. So, a 20% noise parame-ter would mean a gaussian distribution centered at the meanof the filter with a variance of 20% of the mean weight. Theprogressive noise addition follows the steps sparsificationexcept with the addition of noise at every step in the gradi-ent descent.

3. ResultsThe summary table of relevant experiments can be seen

in Table 1. Below is an explanation of the results.

3.1. Diversity of Data

The Age classification network was trained with a vary-ing amount of male/female faces and testing on all-malecorpus and the results are plotted in Figure 7. The depen-dency on male faces showed to be very sensitive to includ-ing a few male faces: the marginal improvement on addingthe first 1500 male faces was bigger than the effect of addingthe next 4500 combined.

3.2. Sharing layers of the Core between tasks

To highlight the maximal effect of learning from mul-tiple tasks and data, we only trained the Age network on

3

Figure 8. The left column of plots is for the Age network, the rightcolumn is the Gender network. The blue training loss and the redtesting accuracy across different shared core architectures. The toprow is independent: ie, the Age network and Gender network arecompletely separate. The bottom row shows the benefit of sharingtwo of the three convolutional layers.

women and tested on men, whereas the Gender networkwas trained on an even split of men and women. The hy-pothesis here is that we should see an improvement on theAge network as it benefits from the representations learnedby the gender network. As can be seen in Figure 8, the test-ing performance of the shared network architecture showsa better classification for the Age network, and the slope ofthe testing accuracy indicates that it would have continuedto improve with more iterations.

3.3. Sparsification

To confirm that Caffe was indeed performing as ex-pected, we set the sparsification to 100%, zeroing out everyweight in the first two layers of the network. As expected,this set the network to constantly predict the same classifi-cation label. Then we explored the effect of differing layersof sparsity. Figure 9 shows the impact of the sparsificationevent, and also the rapid rate of recovery for the network.This is of particular interest given that the training param-eters of the network, consistent across all experiments, issuch that the learning rate is decreasing by an order of mag-nitude every 2000 iterations as well.

3.4. Adding noise

Adding noise to a the first three convolutional layers thatis sharing the first two convolutional layers between the twotasks showed a slight increase in performance. The resultscan be seen in Figure 10

Figure 9. The effect of progressive sparsifying a network as thesystem goes between training the Age and Gender networks. Inthis case every 2000 iterations the training is paused for one net-work and continued for the other. When the network is continuedto train, the system chooses an additional set of weights to sparsify.

Figure 10. The effect of adding noise to the system at every step inthe gradient descent t

Network Age Test Acc. Gender Test Acc.Independent 41% 91%Share2 44% 90%Share3 41% 90%Share2+sparsity2 35% 88%Share2+noise3 44% 90%Share2+noise3+sparse2 32% 86%

Table 1. Summary of results: Shared layers with progressive noiseyielded the best result.

4. Discussion and Conclusion

This work is a first step towards some more focused ex-ploration in the relevant parameters discovered through thisproject. On one hand, most results here are not entirely sur-prising: sparsification may not hurt testing performance butdoesn’t necessarily add to it. Adding a bit of noise improvesthe stochastic gradient descent and a shared representationbetween two tasks learning on two different sets of data candemonstrably improve results. While we see some signalpointing towards the benefit of noise and sharing represen-tation, more work would be necessary to demonstrate sta-tistical significance.

But this work uncovered a lot of subtlety here for fu-ture explorations. Could an optimized sparsity algorithmhelp generalize to novel data? Could an additional task have

4

nothing to do with image but still show a benefit to the ageclassification task, a step toward the Learning Using Priv-ileged Information paradigm that to date has only workedwith SVM? What is the most biomimetic way of trainingthe core network while performing two differing tasks?

Finally, and personally speaking, this was our first at-tempt at using a deep convolutional network to prototypea cognitive architecture. These two networks could bethought of as the fusiform face area (face detection) and lat-eral occipital complex (shape detection); they are two net-works that sit roughly at the same hierarchy of cognitiveprocessing. An interesting path would be to explore the useof hippocampal circuitry into these multi-piece networks.

5. AcknowledgementsWe would like to thank all the people involved in the

MIT Media Lab’s Special Seminar: Future AI for the stim-ulating discussions. We would also like to thank AdityaKhosla for his support with access to powerful computingmachines and his deep knowledge of Caffe.

References[1] White et al “Noise from voltage-gated ion channels may

influence neuronal dynamics in the entorhinal cortex.“J Neurophysiol. 1998 Jul;80(1):262-9.

[2] Blakemore “The social brain in adolescence“ NatureReviews Neuroscience 9, 267-277 (April 2008) —doi:10.1038/nrn2353

[3] Rougier, N.P., Noelle, D., Braver, T.S., Cohen, J.D. &O’Reilly, R.C. (2005). “Prefrontal Cortex and the Flex-ibility of Cognitive Control: Rules Without Symbols.Proceedings of the National Academy of Sciences“,102, 7338-7343.

[4] Gil Levi, Tal Hassner, “Age and Gender Classificationusing Convolutional Neural Networks, IEEE Work-shop on Analysis and Modeling of Faces and Gestures(AMFG)“, IEEE Conf. on Computer Vision and PatternRecognition (CVPR), Boston, June 2015

[5] Publication: E. Eidinger, R. Enbar, and T. Hassner,“Age and Gender Estimation of Unfiltered Faces,“Database: http://www.openu.ac.il/home/hassner/Adience/data.html#agegender

[6] Blakemore, Colin, and Grahame F. Cooper. ?Develop-ment of the brain depends on the visual environment.?(1970): 477-478.

[7] Srivastava, Nitish and Hinton, Geoffrey andKrizhevsky, Alex and Sutskever, Ilya and Salakhut-dinov, Ruslan, “Dropout: A Simple Way to Prevent

Neural Networks from Overfitting“ J. Mach. Learn.Res. 2014

[8] M. L. Seltzer and J. Droppo, ”Multi-task learning indeep neural networks for improved phoneme recogni-tion,” 2013 IEEE International Conference on Acous-tics, Speech and Signal Processing, Vancouver, BC,2013, pp. 6965-6969.

[9] R. Collobert and J. Weston. A Unified Architecture forNatural Language Processing: Deep Neural Networkswith Multitask Learning. In International Conferenceon Machine Learning, ICML, 2008.

[10] Jaderberg, Max, Andrea Vedaldi, and Andrew Zisser-man. ”Speeding up convolutional neural networks withlow rank expansions.” arXiv preprint arXiv:1405.3866(2014).

[11] Yann Le Cun, John S. Denker, and Sara A. Solla.1990. Optimal brain damage. In Advances in neural in-formation processing systems 2, David S. Touretzky(Ed.). Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA 598-605.

5

http://www.openu.ac.il/home/hassner/Adience/data.html#agegender

http://www.openu.ac.il/home/hassner/Adience/data.html#agegender

exploring diversity of task and data in artiﬁcial neural ...€¦ · [email protected] david...

Documents