deep learning and differentiable programmingdic.uqam.ca/upload/files/seminaires/deep learning and...
TRANSCRIPT
![Page 1: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/1.jpg)
Deep Learning, differentiable programming, and software 2.0
(or white is the new black? ) Mounir BoukadoumUQAM, Dep. CS
RuslanSalakhutdinov
Soumit Chintala
Chris Olah
Ian Goodfellow
AndejKarpathy
Ilya Sutskever
Alex Krizhevsky
![Page 2: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/2.jpg)
Young fields [often] start in a very ad‐hoc manner. Later, the mature field is understood very differently … It seems quite likely that deep learning is in this ad‐hoc state.
Chris Olah, Google Brainhttps://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. … In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code.
Andrej Karpathy, OpenAIhttps://medium.com/@karpathy/software‐2‐0‐a64152b37c35
![Page 3: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/3.jpg)
Deep learning has enabled spectacular achievements in solving complex problems of perception and prediction, but…
[With Deep Neural Networks,]machine learning has become alchemy
Ali Rahimi, Google (talk at NIPS 2017)https://www.youtube.com/watch?v=Qi1Yry33TQE
Mostly trial and error success, is there a unifying theory behind current knowledge and practices?
![Page 4: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/4.jpg)
So, is there white-box deep learning? At least three ways to approach the issue
• Neuroscience : reproduction of human intelligence (biological analogies)• Probabilities : inference from available data (latent variable manipulation)• Data representations: transformations in manifolds? (differential calculus)
Currently, deep learning is mostly the third approach, using trial and error; could there be a white box model behind the black box appearance?
https://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 5: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/5.jpg)
Artificial neural network (ANN) 101
5/54
Loose metaphor of biological neural networks Interconnected neurons with similar computation types => computational graph
Neuron ‐> node with I/O edges Synapse ‐> weighted connection
![Page 6: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/6.jpg)
A special type of graph
2‐bit adder with NAND gates ANN equivalent
But in ANNs:• The task is automatically learned from the data
– The neural weights and type(s) of neural outputs set the function
• There is generalization capacity and resilience to imprecision and fragmentary inputshttp://neuralnetworksanddeeplearning.com/chap1.htmlAckerman and Freer, arXiv:1703.09406
Many types de computational graphs exist
![Page 7: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/7.jpg)
Two fundamental topologies
Feedforward architectures good for static problems, recurrent ones for dynamic/ contextual problems (currently studied as “unfolded” feedforward architectures)
7
+BSB, BAM, etc.
Non recurrent Recurrent
Neural Network
![Page 8: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/8.jpg)
Three ways to set the neural weights (learning)
All based on the available data:• Supervised learning: the data are labeled
• Unsupervised learning: the data are not labeled; labelling is done based on patterns/similarities (categorisation);
• Reinforcement learning: the data are not labeled, labelling is done based on generated output value (expectation versus outcome)
![Page 9: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/9.jpg)
Generic two-step operation1. Training (learning)
Done in advance• By programming (C++, Python, Lua, Java, etc.)• Using a NN simulator (Matlab, SNNS, etc.)
Cross‐validation frequently used for consistent results!
2. Using
Training algorithm
Neural Weights
Patterns to learn
9
ANNPattern to classify
Corresponding output
Neural weights
![Page 10: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/10.jpg)
Seminal architecture of deep learning• 1‐2 hidden layers: shallow; more than two
layers : deep
Essentially a projection operator : given at the input, provides at the output
Multi-layer perceptron
10/54
Dynamic/contextual problems handled by recurrent networks that are unfolded
![Page 11: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/11.jpg)
MLP learning process Builds a persistent and hierarchical representation of the data information
• Hidden layers progressively learn deeper intermediate representations
Lee, Largman, Pham & Ng, NIPS 2009Lee, Grosse, Ranganath & Ng, ICML 2009
Layer 1
Parts combine to form objects
Layer 3High‐level linguistic representations
Layer 2
![Page 12: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/12.jpg)
MLP Learning details Supervised; tries to minimize the difference average between a labeled
training set , and its neural representation , • Minimization of the average squared error, expressed as a function of the
neural weights:
The intuitive way to solve 0 doesn’t work (requires to know the data statistics!), ametaheuristic is used, with the assumption that E (stochastic gradient descent)• Many variants exist
In any case, the process requires differentiable error functions!
)(wfE
![Page 13: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/13.jpg)
∆∆
for small ∆ Therefore, ∆ · ∆If w is evolved in the opposite direction to for each learning trial, then∆ and ∆ 0
=> E decreases monotically!
Only the input and final output of the network are known at eachtraining trial, those of the hidden layers must also be determined=> Error backpropagation algorithm (based on the chain rule for derivatives)
13
Stochastic gradient descent
![Page 14: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/14.jpg)
The more layers, the deeper the learning (or so it seems)
2011: 25,8% error with shallow net2012: 16,4% error with 8 layers
2014: 7,3% error with 19 layers
2015: 3,57 error with 152 layers
Double the human performance, but black box operation!
Kaiming He,Xiangyu Zhang, Shaoqing Ren,& Jian Sun."Deep ResidualLearning for Image Recognition".arXiv 2015.
-....
-
25.8
16.4
22 layers
6.719 layers
7.3
28.2
shallow
ImageNet Large Scale Visual Recognition Competition (ILSVRC)
152 layers!
3,57
8 layers 8 layers
11.7
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15Alexnet VGG GoogLeNet ResNEt
'lmageNet: 1000 objects, 1.2 million imagesTop-5 error (%)
![Page 15: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/15.jpg)
In sum… Deep learning is essentially many‐layers MLPs trained by error
backpropagation with (mostly) no side effects At least three technologies and extensions:
• Autoencoders and deep belief (unsupervized learning)
• Convolutional MLPs (supervised learning)• Generative adversarial networks (supervised learning)• Extensions (e.g., unfolded recurrent architectures)
No white‐box model yet!
Bengio Montréal
HintonToronto
Le CunNew York
15/54
![Page 16: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/16.jpg)
Back box = fad?
We [must] think through artificial intelligence from foundational principles rather than from the empirics of past data
Martin Reeves & Mihnea Moldovean, Scientific American, sep. 2017
Starting May 25, [2018,] the European Union will require algorithms to explain their output, making deep learning illegal.
Reported by Pedro Domingos, U. Washington Seattle, Jan. 2018
What if there is a white box waiting to be uncovered, after all?
![Page 17: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/17.jpg)
Back to representations…
Each layer processes the output of its predecessor to create a new data representation (function composition!)
If all the nodes are differentiable, task training by error backpropagation is feasible!
Could this be the start of a white box ANN formalism?
![Page 18: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/18.jpg)
The functional programming connection Three main ANN characteristics:
• Function composition→ output based on embedded transformations
• End‐to‐end differentiability→ optimization
• Weight‐tying→ sub‐network reusability
Can it be that deep learning is just functional programming with reusable blocks, configured by error backpropagation training?
How so?
![Page 19: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/19.jpg)
Ng et al., proc. ICML 09, pp 609‐616
Transfer learning
![Page 20: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/20.jpg)
Current ANN models
Rectangle = vector; arrow = function. (a) fixed-sized input to fixed-sized output (e.g., image classification); (b) Sequence output (e.g., image captioning); (c) Sequence input (e.g., sentiment analysis); (d) sequence to sequence (e.g., translation); (e) sync’ed sequence to sequence (e.g., video frame tagging). Green layer length is arbitrary, being the result of unfolding a recurrent architecture.
http://karpathy.github.io/2015/05/21/rnn‐effectiveness/
Output
Hidden/State
Input
a) b) c) d) e)
20/54
![Page 21: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/21.jpg)
Special neuron:1 input, 3 controls, 1 output
MemoryCell
Input Gate
Output GateOutput control
Forget Gate
Input
LSTM
Forget control
Output
The control signal typically come from perceptrons
Input control
Long Short‐Term Memory (LSTM) adds a neural structure that enables storing, retrieving or erasing the neural state based on context rather than sequentially
Gated Recurrent Unit (GRU) is a close relative But the LSTM and GRU access mechanisms are not differentiable!
How about memory?
![Page 22: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/22.jpg)
Making memory access differentiable Necessary for learning where to write and read Not obvious as memory addresses are fundamentally discrete How about writing and writing everywhere, just to different extents?
• Approach taken in Neural Turing Machines and several other recent models
https://distill.pub/2016/augmented‐rnns/
![Page 23: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/23.jpg)
Making memory differentiable The idea is to link the memory states to an attention mechanism:
Given a memory context cj and a sequence of memory items hi , i=1..n:• A “distance” aij = f(hi, cj) can be defined for each pair (hi, cj)
(f can be implemented with a basic feed‐forward network, making it part of the overall ANN)
• The relative weight (attention) of each hi with respect to cj is thenαi=exp(aij)/∑i=1..n exp(aij)
and a composite attention of all hi with respect to cj can be defined asc = ∑i=1..n αi hi
cj is not longer associated with a single item hi and the steps to distribute across the whole memory are all differentiable!
![Page 24: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/24.jpg)
ANNs as functional graphs MLPs, CNNs and RNNs are all expressible as graphs where the nodes
perform layer computations and the arcs layer interconnections Given differentiable nodes, end‐to‐end graph training by error
backpropagation is possible Two major gains in doing so:
• General purpose computation systems that are automatically configurable for desired outcomes!
• White box modeling through functional similarities and abstractions
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/ https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/
![Page 25: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/25.jpg)
Functional similarties Weight‐tying (multiple reuse of the same neuron as in CNNs and RNNs)
resembles function abstraction Structural patterns of composition resemble higher‐order functions
(e.g., map, fold, unfold, zip)
25/54
![Page 26: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/26.jpg)
fold = Encoding RNNHaskell: foldl a
unfold = Generating RNNHaskell: unfoldr a s
Encoding Recurrent Neural Networks are folds
Generating Recurrent Neural Networks are unfolds
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 27: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/27.jpg)
General Recurrent Neural Networks are accumulating maps.
Accumulating Map = RNNHaskell: mapAccumR a s
![Page 28: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/28.jpg)
Convolutional Neural Networks are a close relative of map.
Windowed Map = Convolutional LayerHaskell: zipWith a xs (tail xs)
Two Dimensional Convolutional Network
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 29: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/29.jpg)
Recursive Neural Networks (“TreeNets”) are catamorphisms, a generalization of folds.
Catamorphism = TreeNetHaskell: cata a
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 30: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/30.jpg)
Examples of building block combinations English to French translation by combining an encoding RNN and a generating RNN,
to essentially perform a fold followed by unfold (Sutskever, et al. (2014)).
Image captions with a convolutional network and a generating RNN. The CNN doesfeature detection and unfold the resulting vector into a description sentence (Vinyals, et al. (2014)).
30/54
![Page 31: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/31.jpg)
Functional Names of Common LayersDeep Learning Name Functional Name
Learned Vector ConstantEmbedding Layer List IndexingEncoding RNN FoldGenerating RNN UnfoldGeneral RNN Accumulating MapBidirectional RNN Zipped Left/Right Accumulating Maps
Conv Layer “WindowMap”TreeNet CatamorphismInverse TreeNet Anamorphism
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 32: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/32.jpg)
Creating differentiable functional graphs
Make algorithmic elements continuous and differentiableNTM on copy task (Graves et al. 2014)
Create/implement a functional language where all primitives are differentiable and expressible in neural form (save basic arithmetic operations), so that we have:
y = f(x) = σ(Wx + b) Structural models already exist (Neural Turing Machine; Stack‐augmented RNN; Stack,
queue, deque), what is missing is the neural programming langageAdapted from http://www.cs.nuim.ie/~gunes/files/Baydin‐MSR‐Slides‐20160201.pdf
![Page 33: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/33.jpg)
Basic differentiable structures based on y = f(x) = σ(Wx + b) Functional expressions (no mutable data inside)
declarative languages (Lisp, Haskell, Erlang, etc.) function h(x) return f(g(x)) h(x) = σ(W1(σ(W2x + b1)) + b2)
endfunction
function f(x, a) if x > 1.0
return a + 1else +(x, y) = σ(Wx +W‘y + b)
return a f(x, a) = if(x, 1.0, +(a, 1.0), a)endif
endfunction
Needed language constructs
Differentiable if, implemented with a TreeNet neural network
https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/
![Page 34: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/34.jpg)
Functional language constructs
Primitive functions f: T ‐> S to carry out the basic σ(Wx + b) building blocks, with W and blearned from the data.
Mechanism to create composite functions from primitive functions, e.g., mlp(x) = f(g(x))
Higher‐level functions that take functions as inputs, generate functions as outputs, or both
Memory constructs (lists? Monads?)
=> calculus!
calculus syntax
All expressions are of the form:
e :: x // variable|x.e1 // function definition|e1 e2 // function application|(e1) // disambiguation
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
![Page 35: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/35.jpg)
Examples of higher-order functions map(Fun, List)
• Applies Fun to each element of List, returning a list of results that may be of a different type
filter(Pred, List)• Returns a sublist of List that contains the elements of List that satisfy the predicate Pred
foldl(Fun, Acc, List)• Calls Fun on successive pairs of elements of List , starting with Acc and returning the same type
Etc. 35/54
![Page 36: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/36.jpg)
More higher-order functions
all(Pred, List) any(Pred, List) takewhile(Pred, List)
dropwhile(Pred, List) flatten(DeepList) flatmap(Fun, List)
foreach(Fun, List) partition(Pred,List) zip(List1,List2)
unzip(List) …
36
![Page 37: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/37.jpg)
Software is dead, long live software?
Current software is imperative (sequence of instructions, each one imparting a behaviour to a point in program space)• But for most real‐world problems, it is easier to state desired behaviour (e.g., via input‐
output examples) than to write executable code
V2.0 would be declarative: the “programmer” specifies the outcome and a composition of neural building blocks is searched for to provide it • Deep learning searches in continuous manifolds (for dimensionality reduction and to
make gradient descent possible)
Software should switch from writing programs, maintaining repositories and doing run‐time analysis to collecting, analyzing and preparing data for a neural network
![Page 38: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/38.jpg)
Classical program: Sequence of executable instructions to perform a specified task
Differentiable program: Sequence of problem domain declarations on how to perform a specified task
• Functional blocks for white box operation• Differentiable nodes for auto‐configuration by
error backpropagation learning
From classical to differentiable machines
![Page 39: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/39.jpg)
How about existing frameworks? Currently two Types of computational graphs: Symbolic
• Typical representatives: Theano, Tensorflow, CGT• Fine‐grained• Graph analysis and optimizations
Modular• Typical representatives: Torch, Caffe• Coarse‐grained• Manually designed modules
Similarities• Model definition using a (constrained) symbolic language• Automatic handling of backpropagation in the final model
(no need to code derivatives along)
(Kenneth Tran. “Evaluation of Deep Learning Toolkits”.https://github.com/zer0n/deepframeworks)
![Page 40: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/40.jpg)
You are limited to symbolic graph building, with the mini‐language
You build this symbolic graph:
For example, instead of this in pure Python (for y=Ak):
But no direct functional building as such
http://deeplearning.net/software/theano/library/scan.html 40/54
![Page 41: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/41.jpg)
Current efforts Neural programmers (a bit similar genetic programming) Neural Programmer‐Interpreters (with by‐example supervision) Neural Turing Machines DiffSharp (High‐order differentiation) Autograd (automatic differentiation of numPy and Python code) DNNGraph (Haskell model to caffe and Torch scripts) Etc.
All in the last couple of years, but is gradient descent really necessary?How about copying biology?
![Page 42: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/42.jpg)
A biologically-inspired neural building block
![Page 43: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/43.jpg)
A loop-based neural architecture
Gisiger & Boukadoum, Neural networks, 2018
![Page 44: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/44.jpg)
Delayed-response task (DRT) Tests the ability to respond to stimuli based on short‐term memory Three major steps, repeated over a number of trials:
• Cue: sensory information to retain (e.g., image, dot on a screen, auditory stimulus)• Delay: The cue is withdrawn for an arbitrary delay;• Response: cue‐related action (e.g., identify a cue image in a set, or point to the location where the dot initially appeared).
Although seemingly simple, the task requires complex mental processing :1. Sensing the cue information, say a visual representation (VR) ;2. committing the cue information to short‐term memory;3. protecting it from interference by external and internal distractions; 4. using the information stored in working memory to produce the correct motor response (PM);5. discarding this information at the end of the trial in preparation for the next one (Reset).
![Page 45: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/45.jpg)
Implementing DRT with a loop-based network
45/54
![Page 46: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/46.jpg)
LSTM perspective
![Page 47: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/47.jpg)
![Page 48: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/48.jpg)
Many obstacles remain Need for more parallel processing and better energy efficiency
• Both at the hardware and software level
Need for training with less data Lift the algebraically expressible data restriction (vectors, matrices,
tensors…) Gradient descent learning is convex optimization; non‐convex techniques
have not been studied due to apparent NP‐hardness Serious side effects!
![Page 49: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/49.jpg)
Noise effects (and hacker opportunities!)
http://arxiv.org/pdf/1312.6199v4.pdfhttps://codewords.recurse.com/issues/five/why‐do‐neural‐networks‐think‐a‐panda‐is‐a‐vulturehttps://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196
![Page 50: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/50.jpg)
The algorithm is deceivably simple
1. Feed in the photo to hack2. Get the neural network’s prediction and see
how far off it is from the target answer3. Tweak the photo using back‐propagation to
make the prediction closer to the target answer
4. Repeat steps 1–3 with the same photo until the network gives us the answer we want
Adding an imperceptibly small vector of the same sign as the gradient of the cost function with respect to the input can drastically change the image classification.https://arxiv.org/abs/1412.6572
+ 0.007 =
x sign( x J (θ, x, y)) x + sign( x J (θ, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
https://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196 50/54
+ 0.007 =
Sometimes, it doesn’t work!
![Page 51: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/51.jpg)
Overfitting
https://ml.berkeley.edu/blog/2017/07/13/tutorial‐4/
9055.5 90555 316942.5 452773 217331
1 = 1
2 = 3
3 = 5
4 = 7
5 = 217341
Answer: 217341!
The consequences cans be disastrous!
![Page 52: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/52.jpg)
Data order
• Capture of invariant “spatial motives” possible
22 1A a@a 1 aa a1.a 123 aa1
33 2B b@b 2 bb b2.b 234 bb2
44 3C c@c 3 cc c3.c 345 cc3
55 4D d@d 4 dd d4.d 456 dd4
66 5E e@e 5 ee e5.e 567 ee5
77 6F f@f 6 ff f6.f 678 ff6
88 7G g@g 7 gg g7.g 789 gg7
99 8H h@h 8 hh h8.h 890 hh8
111 9I i@i 9 ii i9.i 901 ii9
• Capture of invariant “spatial motives” doubtful if the row of column order is arbitrary
![Page 53: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/53.jpg)
In summary… Efforts are under way to make white the new black Until then, deep learning remains a black box, and neural
network parameter tuning an art Currently, the choice is between 80‐90% accurate, non‐
DL models that we understand, or 99% accurate DL models that we don’t!
![Page 54: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin](https://reader034.vdocuments.us/reader034/viewer/2022050507/5f9848829e4e48561319d9fb/html5/thumbnails/54.jpg)