raja giryes tel aviv university -...

125
MATHEMATICS OF DEEP LEARNING RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29, 2016 1

Upload: ngocong

Post on 24-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

MATHEMATICS OF DEEP LEARNING

RAJA GIRYES

TEL AVIV UNIVERSITY

Mathematics of Deep Learning Tutorial

EUSIPCO Conference, Budapest, Hungary

August 29, 2016 1

Page 2: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

AGENDA

• History and introduction to deep learning.

• A sample of existing theory for deep learning.

• Data structure based theory for deep learning

• Neural networks with random Gaussian weights.

• Generalization error of deep neural networks.

• Deep learning as metric learning.

• Solving minimization problems via deep learning.

2

Page 3: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

HISTORY AND INTRODUCTION TO

DEEP LEARNING

3

Page 4: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FIRST LEARNING PROGRAM

4

1956

“field of study that gives computers the ability to learn without being explicitly programmed”. [Arthur Samuel, 1959]

Page 5: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

IMITATING THE BRAIN

5Wiesel and Hubel, 1959

Page 6: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

HUMAN VISUAL SYSTEM

6

In the visual cortex there are two types of neurons:

Simple and complex

Page 7: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

IMITATING THE HUMAN BRAIN

7

Fukushima 1980

Page 8: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CONVOLUTIONAL NEURAL NETWORKS

• Introduction of convolutional neural networks [LeCun et. al. 1989]

• Training by backpropagation

8

Page 9: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SCENE PARSING

[Farabet et al., 2012, 2013]

• Deep Learning usage before 2012:

9

Page 10: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

2012 IMAGENET DEEP LEARNING BREAKTHROUGH

• Imagenet dataset

• 1.4 Million images

• 1000 categories

• 1.2 Million for training

• 150000 for testing

• 50000 for validation

10

Today deep learning achieves 3.5% by 152 layers [He, Zhang, Ren & Sun, 2016]

[Krizhevsky, Sutskever& Hinton, 2012]

Page 11: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DEEP NETWORK STRUCTURE

• What each layer of the network learns?

[Krizhevsky, Sutskever & Hinton, 2012]

11

Page 12: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LAYERS STRUCTURE

• First layers detect simple patterns that corresponds to simple objects

[Zeiler & Fergus, 2014]12

Page 13: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LAYERS STRUCTURE

• Deeper layers detects more complex patterns corresponding to more complex objects.

[Zeiler & Fergus, 2014]13

Page 14: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LAYERS STRUCTURE

[Zeiler & Fergus, 2014] 14

Page 15: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

WHY THINGS WORK BETTER TODAY?

• More data – larger datasets, more access (internet)

• Better hardware (GPU)

• Better learning regularization (dropout)

• Deep learning impact and success is not unique only to image classification.

• But it is still unclear why deep neural networks are so remarkably successful and how they are doing it.

15

Page 16: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DEEP LEARNING FOR SPEECH RECOGNITION

16

Page 17: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

OBJECT DETECTION

17[Szegedy et al., 2015]

Page 18: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GAME PLAYING

18[Mnih et al., 2013, 2015]

Page 19: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GO GAME

• AlphaGo - First computer program to ever beat a professional player at the game of go

• Program created by Google DeepMind

• Game strategy was learned using deep learning [Silver et al., 2016].

19

Page 20: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CUTTING EDGE PERFORMANCE IN MANY OTHER APPLICATIONS

• Disease diagnosis [Zhou, Greenspan & Shen, 2016].

• Language translation [Sutskever et al., 2014].

• Video classification [Karpathy et al., 2014].

• Face detection [Schroff et al., 2015].

• Handwriting recognition [Poznanski & Wolf, 2016].

• Sentiment classification [Socher et al., 2013].

• Image denoising [Burger et al., 2012].

• Super-resolution [Kim et al., 2016], [Bruna et al., 2016].

• many other applications…20

Page 21: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DEEP NEURAL NETWORKS (DNN)

• One layer of a neural net

• Concatenation of the layers creates the whole net

Φ(𝑋1, 𝑋2, … , 𝑋𝐾) = 𝜓 𝜓 𝜓 𝑉𝑋1 𝑋2 … 𝑋𝐾

𝑉 ∈ ℝ𝑑 𝑿 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎

𝑋 is a linear operation

𝝍 is a non-linear function

𝑉𝑋

𝑽 ∈ ℝ𝒅 𝑿𝟏 𝝍 𝑿𝒊 𝝍 𝑿𝑲 𝝍

21

Page 22: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CONVOLUTIONAL NEURAL NETWORKS (CNN)

• In many cases, 𝑋 is selected to be a convolution.

• This operator is shift invariant.

• CNN are commonly used with images as they are typically shift invariant.

𝑽 ∈ ℝ𝒅 𝑿 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎

𝑋 is a linear operation

𝐹 is a non-linear function

𝑉𝑋

22

Page 23: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

THE NON-LINEAR PART

• Usually 𝜓 = 𝑔 ∘ 𝑓.

• 𝑓 is the (point-wise) activation function

• 𝑔 is a pooling or an aggregation operator.

ReLU 𝑓(x) = max(x, 0)

Sigmoid

𝑓 𝑥 =1

1 + 𝑒−𝑥

Hyperbolic tangent

𝑓 𝑥 = tanh(𝑥)

𝑉1 𝑉2 𝑉𝑟𝑉3 𝑉4 … … … …

max𝑖

𝑉𝑖

Max pooling Mean pooling

1

𝑛

𝑖=1

𝑛

𝑉𝑖

𝑙𝑝 pooling𝑝

𝑖=1

𝑛

𝑉𝑖𝑝

𝑿 𝝍

23

Page 24: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

A SAMPLE OF EXISTING THEORY FOR

DEEP LEARNING

24

Page 25: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

WHY DNN WORK?

What is so special with the DNN structure? What is the role of

the depth of DNN?

What is the role of pooling?

What is the role of the activation

function?

How many training samples

do we need?

What is the capability of DNN?

What happens to the data throughout the

layers?25

Page 26: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

REPRESENTATION POWER

• Neural nets serve as a universal approximation for any measurable Borel functions [Cybenko 1989, Hornik 1991].

• In particular, let the non-linearity 𝜓 be a bounded, non-constant continuous function, 𝐼𝑑 be the 𝑑-dimensional hypercube, and 𝐶 𝐼𝑑 be the space of continuous functions on 𝐼𝑑. Then for any 𝑓 ∈ 𝐶 𝐼𝑑and 𝜖 > 0, there exists 𝑚 > 0, and 𝑋 ∈ ℝ𝑑×𝑚, 𝐵 ∈ ℝ𝑚, 𝑊 ∈ ℝ𝑚 such that the neural network

𝐹 𝑉 = 𝜓 𝑉𝑋 + 𝐵 𝑊𝑇

approximates 𝑓 with a precision 𝜖:

𝐹 𝑉 − 𝑓 𝑉 < 𝜖, ∀𝑉 ∈ ℝ𝑑

26

Page 27: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ESTIMATION ERROR

• The estimation error of a function f by a neural networks scales as [Barron 1994].

𝑶𝑪𝒇

𝑵+ 𝑶

𝑵𝒅

𝑳𝐥𝐨𝐠(𝑳)Smoothness of

approximated function

Number of neurons in the

DNN

Number of training

examples

Input dimension

27

Page 28: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DEPTH OF THE NETWORK

• DNN allow representing restricted Boltzmann machines with a number of parameters exponentially greater than the number of the network parameters [Montúfar & Morton, 2015]

• Each DNN layer with ReLU divides the space by a hyper-plane.

• Therefore the depth of the network divides the space into an exponential number of sets compared to the number of parameters [Montúfar,

Pascanu, Cho & Bengio, 2014]

28

Page 29: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DEPTH EFFICIENCY OF CNN

• Function realized by CNN, with ReLU and max-pooling, of polynomial size requires super-polynomial size for being approximated by shallow network [Telgarsky 2016 ,Cohen et al., 2016].

• Standard convolutional network design has learning bias towards statistics of natural images [Cohen et al., 2016].

29

Page 30: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ROLE OF POOLING

• The pooling stage provides shift invariance [Boureau et

al. 2010], [Bruna, LeCun & Szlam, 2013].

• A connection is drawn between the pooling stage and the phase retrieval methods [Bruna, Szlam & LeCun, 2014].

• This allows calculating Lipchitz constants of each DNN layer 𝜓(∙ 𝑋) and empirically recovering the input of a layer from its output.

• However, the Lipchitz constants calculated are very loose and no theoretical guarantees are given for the recovery.

30

Page 31: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SUFFICIENT STATISTIC AND INVARIANCE

• Given a certain task at hand:

• Minimal sufficient statistic guarantees that we can replace raw data with a representation with smallest complexity and no performance loss.

• Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data.

• CNN are shown to have these properties for many tasks [Soatto & Chiuso, 2016].

• Good structures of deep networks can generate representations that are good for learning with a small number of examples [Anselmi et al., 2016].

31

Page 32: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SCATTERING TRANSFORMS

• Scattering transform - a cascade of wavelet transform convolutions with nonlinear modulus and averaging operators.

• Scattering coefficients are stable encodings of geometry and texture [Bruna & Mallat, 2013]

32

Original image with 𝑑 pixels

Recovery from first scattering moments: 𝑂 log 𝑑 coefficients

Recovery from 1st & 2nd

scattering moments:

𝑂 log2 𝑑 coefficientsImages from slides of Joan Bruna in ICCV 2015 tutorial

Page 33: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SCATTERING TRANSFORMS AND DNN

• More layers create features that can be made invariant to increasingly more complex deformations.

• Deep layers in DNN encode complex, class-specific geometry.

• Deeper architectures are able to better capture invariant properties of objects and scenes in images[Bruna & Mallat, 2013], [Wiatowski & Bölcskei, 2016]

33

Page 34: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SCATTERING TRANSFORMS AS A METRIC

• Scattering transforms may be used as a metric.

• Inverse problems can be solved by minimizing distance at the scattering transform domain.

• Leads to remarkable results in super-resolution[Bruna, Sprechmann & Lecun, 2016]

34

Page 35: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SCATTERING SUPER RESOLUTION

Original Best Linear Estimate State-of-the-art Scattering estimate

Images from slides of Joan Bruna in CVPR 2016 tutorial

[Bruna, Sprechmann & Lecun, 2016]35

Page 36: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

MINIMIZATION

• The local minima in deep networks are not far from the global minimum.

• saddle points are the main problem of deepLearning optimization.

• Deeper networks have more local minima but less saddle points. [Saxe, McClelland & Ganguli, 2014], [Dauphin, Pascanu, Gulcehre, Cho, Ganguli & Bengio, 2014] [Choromanska, Henaff, Mathieu, Ben Arous & LeCun, 2015]

36

[Choromanska et al., 2015]

Page 37: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GLOBAL OPTIMALITY IN DEEP LEARNING

• Deep learning is a positively homogeneous factorization problem, i.e., ∃𝑝 ≥ 0 such that ∀𝛼 ≥ 0 DNN obey

Φ 𝛼𝑋1, 𝛼𝑋2, … , 𝛼𝑋𝐾 = 𝛼𝑝Φ 𝑋1, 𝑋2, … , 𝑋𝐾 .

• With proper regularization, local minima are global.

• If the network is large enough, global minima can be found by local descent.

Guarantees of proposed framework

[Haeffele & Vidal, 2015]37

Page 38: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DATA STRUCTURE BASED THEORY FOR

DEEP LEARNING

38

Page 39: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

OUTLINE

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

39

Page 40: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

• Infusion of random weights reveals internal properties of a system

ASSUMPTIONS – GAUSSIAN WEIGHTS

𝑉 ∈ ℝ𝑑 𝑿𝟏 𝝍 𝑿𝒊 𝝍 𝑿𝑲 𝝍

𝑋𝑖 , … , 𝑋𝑖, …,𝑋𝐾 are random Gaussian matrices

Compressed Sensing

Phase Retrieval

Sparse Recovery

Deep Learning

[Saxe et al. 2014]

[Dauphin et al. 2014]

[Choromanskaet al. 2015] [Arora et

al. 2014]

40

Page 41: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

• Pooling provides invariance [Boureau et. al. 2010, Bruna et. al. 2013].

We assume that all equivalent points in the data were merged together and omit this stage.

Reveals the role of the other components in the DNN.

ASSUMPTIONS – NO POOLING

𝑽 ∈ ℝ𝒅 𝑿𝟏 𝝍 𝑿𝒊 𝝍 𝑿𝑲 𝝍

𝜓 is an element wise activation function

max(v, 0) 1

1 + 𝑒−𝑥tanh(𝑣)

41

Page 42: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ASSUMPTIONS – LOW DIMENSIONAL DATA

Υ is a low dimensional set

𝑽 ∈ 𝜰 𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓

Gaussian Mixture Models (GMM)

Signals with Sparse

Representations

Low Dimensional

Manifolds

Low Rank Matrices

42

Page 43: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Gaussian Mean Width

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

43

Page 44: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

WHAT HAPPENS TO SPARSE DATA IN DNN?

• Let Υ be sparsely represented data

• Example: Υ = {𝑉 ∈ ℝ3: 𝑉 0 ≤ 1}

• ΥX is still sparsely represented data

• Example: ΥX = {𝑉 ∈ ℝ3: ∃𝑊 ∈ ℝ3, 𝑉 = 𝑊𝑋, 𝑊 0 ≤ 1}

• 𝜓(ΥX) not sparsely represented

• But is still low dimensional

Υ𝑋

𝜓(Υ𝑋)

Υ

44

Page 45: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GAUSSIAN MEAN WIDTH

• Gaussian mean width:𝝎 𝜰 = 𝑬 𝐬𝐮𝐩

𝑽,𝑾∈𝜰𝑽 − 𝑾, 𝒈 , 𝒈~𝑵(𝟎, 𝑰).

𝑾

𝑽

𝜰

𝒈The width of the set 𝜰 in

the direction of 𝒈:

45

Page 46: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

MEASURE FOR LOW DIMENSIONALITY

• Gaussian mean width:𝝎 𝜰 = 𝑬 𝐬𝐮𝐩

𝑽,𝑾∈𝜰𝑽 − 𝑾, 𝒈 , 𝒈~𝑵(𝟎, 𝑰).

• 𝝎𝟐 𝜰 is a measure for the dimensionality of the data.

• Examples:

If Υ ⊂ 𝔹𝑑 is a Gaussian Mixture Model with 𝑘Gaussians then

𝝎𝟐 𝜰 = 𝑶(𝒌)

If Υ ⊂ 𝔹𝑑 is a data with 𝑘-sparse representations then𝝎𝟐 𝜰 = 𝑶(𝒌 𝐥𝐨𝐠 𝒅)

46

Page 47: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Theorem 1: small 𝝎𝟐 𝜰

𝒎imply 𝝎𝟐 𝜰 ≈ 𝝎𝟐 𝝍(𝑽𝑿)

GAUSSIAN MEAN WIDTH IN DNN

𝜰 ⊂ ℝ𝒅

𝑿 𝝍

𝝍(𝑽𝑿) ∈ ℝ𝒎𝑋 is a linear operation

𝐹 is a non-linear function

𝜰𝑿 ⊂ ℝ𝒎

Small 𝝎𝟐 𝜰 Small 𝝎𝟐 𝝍(𝑽𝑿)

It is sufficient to provide proofs only for a single layer

47

Page 48: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

Stability

48

Page 49: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ASSUMPTIONS

𝑋 is a random

Gaussian matrix

𝜓 is an element wise

activation function

𝑽 ∈ 𝜰

𝑽 ∈ 𝕊𝒅 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑿

max(v, 0) 1

1 + 𝑒−𝑥tanh(𝑣)

𝑚 = 𝑂 𝛿−6𝜔2 Υ

49

Page 50: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ISOMETRY IN A SINGLE LAYER

Theorem 2: 𝜓(∙ 𝑋) is a 𝛿-isometry in the Gromov-Hausdorff sense between the sphere 𝕊𝑑−1 and the Hamming cube [Plan & Vershynin, 2014, Giryes, Sapiro & Bronstein 2016].

𝑽 ∈ 𝕊𝒅 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑋

• If two points belong to the same tile

then their distance < 𝜹• Each layer of the network keeps the

main information of the data

The rows of 𝑋 create a tessellation of the space. This stands in line with [Montúfar et. al. 2014] This structure can be used for hashing 50

Page 51: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN AND HASHING

• A single layer performs a locally sensitive hashing.

• Deep network with random weights may be designed to do better [Choromanska et al., 2016].

• It is possible to train DNN for hashing, which provides cutting-edge results [Masci et al., 2012], [Lai et al., 2015].

51

Page 52: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN STABLE EMBEDDING

𝑽 ∈ 𝕊𝒅 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑿

Theorem 3: There exists an algorithm 𝒜 such that

𝑉 − 𝒜(𝜓(𝑉𝑋)) < 𝑂𝜔 Υ

𝑚= 𝑂 𝛿3

[Plan & Vershynin, 2013, Giryes, Sapiro & Bronstein 2016].

After 𝐾 layers we have an error 𝑂 𝐾𝛿3

Stands in line with [Mahendran and Vedaldi, 2015].

DNN keep the important information of the data

52

Page 53: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

RECOVERY FROM DNN OUTPUT

[Mahendran and Vedaldi, 2015].53

Page 54: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

DNN with Gaussian Weights

54

Page 55: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ASSUMPTIONS

𝑋 is a random

Gaussian matrix

𝜓 is the ReLU𝑉 ∈ Υ

𝑽 ∈ ℝ𝒅 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑿

max(v, 0)

𝒎 = 𝑶 𝜹−𝟒𝝎𝟒 𝜰

55

Page 56: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DISTANCE DISTORTION

𝑽 ∈ 𝜰 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑿

Theorem 4: for 𝑉, 𝑊 ∈ Υ

𝜓(𝑉𝑋) − 𝜓(𝑊𝑋) 2 − 12 V − W 2

−V W

𝜋(sin ∠ V, W

∠ V, W

The smaller ∠ V, W the smaller the distance we get between the points

56

Page 57: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ANGLE DISTORTION

𝑽 ∈ 𝜰 𝝍 𝝍(𝑽𝑿) ∈ ℝ𝒎𝑉𝑋𝑋

Theorem 5: for 𝑉, 𝑊 ∈ Υ

cos ∠ 𝜓(𝑉𝑋), 𝜓(W𝑋) − cos ∠ V, W

−1

𝜋(sin ∠ V, 𝑊

Behavior of ∠ 𝜓(𝑉𝑋), 𝜓(𝑊𝑋)

57

Page 58: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DISTANCE AND ANGLES DISTORTION

Points with small angles between them become closer than points with larger angles between them

𝑋 𝜓

Class IIClass I Class IIClass I

58

Page 59: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

POOLING AND CONVOLUTIONS

• We test empirically this behavior on convolutional neural networks (CNN) with random weights and the MNIST, CIFAR-10 and ImageNet datasets.

• The behavior predicted in the theorems remains also in the presence of pooling and convolutions.

59

Page 60: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

TRAINING DATA SIZE

• Stability in the network implies that close points in the input are close also at the output

• Having a good network for an 휀-net of the input set Υguarantees a good network for all the points in Υ.

• Using Sudakov minoration the number of data points is

exp(𝜔2 Υ /휀2) .

• Though this is not a tight bound, it introduces the Gaussian mean width 𝜔 Υ as a measure for the complexity of the input data and the required number of training samples.

60

Page 61: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

Role of Training

61

Page 62: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ROLE OF TRAINING

• Having a theory for Gaussian weights we test the behavior of DNN after training.

• We looked at the MNIST, CIFAR-10 and ImageNetdatasets.

• We will present here only the ImageNet results.

• We use a state-of-the-art pre-trained network for ImageNet [Simonyan & Zisserman, 2014].

• We compute inter and intra class distances.

62

Page 63: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Compute the distance ratio: 𝑽− 𝒁

𝑾−𝑽

INTER BOUNDARY POINTS DISTANCE RATIO

Class IIClass I

Class IIClass I

𝑾𝑽

𝑉 is a random point and 𝑊 its closest point from

a different class.

𝑽

𝑉 is the output of 𝑉 and 𝑍 the closest point to 𝑉 at the output from a

different class.

𝑾 − 𝑽 𝒁

𝑽 − 𝒁

𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓

63

Page 64: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Compute the distance ratio: 𝑽− 𝒁

𝑾−𝑽

INTRA BOUNDARY POINTS DISTANCE RATIO

Class IIClass I Class IIClass I

𝑾

𝑽

Let 𝑉 be a point and 𝑊its farthest point from

the same class.

𝑽

Let 𝑉 be the output of 𝑉 and 𝑍 the farthest point from 𝑉 at the output

from the same class

𝑾 − 𝑽

𝒁

𝑽 − 𝒁

𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓

64

Page 65: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Inter-class Intra-class

𝑉 − 𝑍

𝑊 − 𝑉

𝑉 − 𝑍

𝑊 − 𝑉

BOUNDARY DISTANCE RATIO

65

Page 66: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Compute the distance ratios: 𝑽−𝑾

𝑽−𝑾,

𝑽− 𝒁

𝑽−𝒁

AVERAGE POINTS DISTANCE RATIO

Class II

Class I

Class IIClass I𝒁

𝑽

𝑉, 𝑊 and 𝑍 are three random points

𝑽

𝑉, 𝑊 and 𝑍 are the outputs of 𝑉, 𝑊and 𝑍 respectively.

𝑽 − 𝑾 𝒁

𝑽 − 𝒁

𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓

𝑾

𝑽 − 𝑾

𝑽 − 𝒁

𝑊

66

Page 67: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

AVERAGE DISTANCE RATIO

Inter-class Intra-class

𝑉 − 𝑊

𝑉 − 𝑊

𝑉 − 𝑍

𝑉 − 𝑍67

Page 68: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ROLE OF TRAINING

• On average distances are preserved in the trained and random networks.

• The difference is with respect to the boundary points.

• The inter distances become larger.

• The intra distances shrink.

68

Page 69: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

Generali-zationError

69

Page 70: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

softmax/ linear

classifier

ASSUMPTIONS

𝑿𝟏 𝝍 𝑿𝒊 𝝍 𝑿𝑲 𝝍

general non-linearity (ReLU, pooling,…)

𝐓𝐰𝐨

𝐂𝐥𝐚𝐬𝐬𝐞𝐬 𝑾

𝒘𝑻𝜱 𝑿𝟏, 𝑿𝟐, … , 𝑿𝑲 = 𝟎

∈ 𝜰

Input Space Feature Space70

Page 71: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CLASSIFIER TYPES

• Denote the output of the DNN by 𝑍.

• Linear classifier 𝑊𝑇 is of the form

𝑍𝑊𝑇 ≷ 𝑏,

where b is a certain threshold.

• Softmax classifier predicts the probability of class i:

𝜎 𝑍 𝑖 =𝑒𝑍𝑖

𝑒𝑍1 + 𝑒𝑍2

71

Class 1

Class 2

Page 72: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CLASSIFICATION OBJETIVES

• Denote the output of the DNN by 𝑍.

• Denote by 𝑡𝑖 the expected output of 𝑍𝑖

• Categorical cross entropy:

∑ log 𝑍𝑖 𝑡𝑖 .

• Hinge loss:max 0,1 − 𝑍𝑖𝑡𝑖

• Weight decay: penalty on the weight matrices,

∑ 𝑋𝑖

72

Page 73: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GENERALIZATION ERROR (GE)

• In training, we reduce the classification error ℓtraining of the training data as the number of

training examples 𝐿 increases.

• However, we are interested to reduce the error ℓtest of the (unknown) testing data as 𝐿 increases.

• The difference between the two is the generalization error

GE = ℓtraining − ℓtest

• It is important to understand the GE of DNN

73

Page 74: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

REGULARIZATION TECHNIQUES

• Weight decay – penalizing DNN weights [Krogh & Hertz, 1992].

• Dropout - randomly drop units (along with their connections) from the neural network during training [Hinton et al., 2012], [Baldi & Sadowski, 2013], Srivastava et al., 2014].

• DropConnect – dropout extension [Wan et al., 2013]

• Batch normalization [Ioffe & Szegedy, 2015].

• Stochastic gradient descent (SGD) [Hardt, Recht & Singer, 2016].

• Path-SGD [Neyshabur et al., 2015].

• And more [Rifai et al., 2011], [Salimans & Kingma, 2016], [Sun et al, 2016].

74

Page 75: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

A SAMPLE OF GE BOUNDS

• Using the VC dimension it can be shown that

GE ≤ 𝑂 DNN params ∙log 𝐿

𝐿

[Shalev-Shwartz and Ben-David, 2014].

• The GE was bounded also by the DNN weights

GE ≤1

𝐿2𝐾 𝑤 2

𝑖

𝑋𝑖2,2

[Neyshabur et al., 2015].

75

Page 76: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

A SAMPLE OF GE BOUNDS

• Using the VC dimension it can be shown that

GE ≤ 𝑂 DNN params ∙log 𝐿

𝐿

[Shalev-Shwartz and Ben-David, 2014].

• The GE was bounded also by the DNN weights

GE ≤1

𝐿2𝐾 𝑤 2

𝑖

𝑋𝑖2,2

[Neyshabur et al., 2015].

• Note that in both cases the GE grows with the depth

76

Page 77: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN INPUT MARGIN

• Theorem 6: If for every input margin 𝛾𝑖𝑛 𝑉𝑖 > 𝛾

then 𝐺𝐸 ≤ 𝑁𝛾/2(Υ) 𝐿

• 𝑁𝛾/2(Υ) is the covering number of the data Υ.

• 𝑁𝛾/2(Υ) gets smaller as 𝛾 gets larger.

• Bound is independent of depth.

• Our theory relies on the robustness framework [Xu & Mannor, 2012].

𝑉𝑖

𝑉𝑖

[Sokolic, Giryes, Sapiro, Rodrigues, 2016]

77

Page 78: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

INPUT MARGIN BOUND

• Maximizing the input margin directly is hard

• Our strategy: relate the input margin to the output margin 𝛾𝑜𝑢𝑡 𝑉𝑖 and other DNN properties

• Theorem 7:

𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖

sup𝑉∈Υ

𝑉

𝑉 2𝐽 𝑉

2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖𝐹

𝑉𝑖

Φ(𝑉𝑖)

[Sokolic, Giryes, Sapiro, Rodrigues, 2016] 78

Page 79: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

OUTPUT MARGIN

• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖

sup𝑉∈Υ

𝑉

𝑉 2𝐽 𝑉

2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖𝐹

• Output margin is easier tomaximize – SVM problem

• Maximized by many cost functions, e.g., hinge loss.

𝑉𝑖

Φ(𝑉𝑖)

79

Page 80: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

GE AND WEIGHT DECAY

• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖

sup𝑉∈Υ

𝑉

𝑉 2𝐽 𝑉

2

𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖𝐹

• Bounding the weights increases the input margin

• Weight decay regularizationdecreases the GE

• Related to regularization used by [Haeffele & Vidal, 2015]

𝑉𝑖

Φ(𝑉𝑖)

80

Page 81: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

JACOBIAN BASED REGULARIZATION

• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖

sup𝑉∈Υ

𝑉

𝑉 2𝐽 𝑉

2

𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖2

≥𝛾𝑜𝑢𝑡 𝑉𝑖

1≤𝑖≤𝐾 𝑋𝑖𝐹

• 𝐽 𝑉 is the Jacobian of the DNN at point 𝑉.

• 𝐽 ∙ is piecewise constant.

• Using the Jacobian of theDNN leads to a better bound.

• New regularization technique.81

Page 82: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

RESULTS

• Better performance with less training samples

• CCE: the categorical cross entropy.

• WD: weight decay regularization.

• LM: Jacobian based regularization for large margin.

• Note that hinge loss generalizes better than CCE and that LM is better than WD as predicted by our theory.

MNIST Dataset

[Sokolic, Giryes, Sapiro, Rodrigues, 2016]

82

Page 83: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

DNN as Metric

Learning

83

Page 84: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

𝝍

ASSUMPTIONS

𝑋 is fully connectedand trained

𝝍 is the hyperbolic tan

𝑽 ∈ ℝ𝒅 𝝍𝑿𝟏 𝑿𝟐 𝑽

84

Page 85: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

METRIC LEARNING BASED TRAINING

• Cosine Objective:

min𝑋1,𝑋2

𝑖,𝑗∈𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑆𝑒𝑡

𝑉𝑖𝑇 𝑉𝑗

𝑉𝑖 𝑉𝑗

− 𝜗𝑖,𝑗

2

𝜗𝑖,𝑗 = 𝜆 + (1 − 𝜆)

𝑉𝑖𝑇𝑉𝑗

𝑉𝑖 𝑉𝑗

𝑖, 𝑗 ∈ 𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠

−1 𝑖, 𝑗 ∈ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠

𝝍𝑽𝒊 ∈ ℝ𝒅 𝝍𝑉𝑋𝑿𝟏 𝑿𝟐 𝑽𝒊

Classification term

Metric preservation term

85

Page 86: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

METRIC LEARNING BASED TRAINING

• Euclidean Objective:

min𝑋1,𝑋2

𝜆𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔

𝑆𝑒𝑡

𝑖,𝑗∈𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔

𝑆𝑒𝑡

𝒍𝒊𝒋 𝑉𝑖 − 𝑉𝑗 − 𝒕𝒊𝒋 +

+ 1−𝜆𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠

𝑉𝑖,𝑉𝑗 𝑎𝑟𝑒

𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠

𝑉𝑖 − 𝑉𝑗 − 𝑉𝑖 − 𝑉𝑗

𝝍𝑽𝒊 ∈ ℝ𝒅 𝝍𝑉𝑋𝑿𝟏 𝑿𝟐 𝑽𝒊

𝒍𝒊𝒋 = 𝟏 𝒊, 𝒋 ∈

𝒔𝒂𝒎𝒆𝒄𝒍𝒂𝒔𝒔

−𝟏 𝒊, 𝒋 ∈𝒅𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒕

𝒄𝒍𝒂𝒔𝒔

𝒍𝒊𝒋 =

𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒊𝒏𝒕𝒓𝒂𝒄𝒍𝒂𝒔𝒔 𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆

𝒊, 𝒋 ∈𝒔𝒂𝒎𝒆𝒄𝒍𝒂𝒔𝒔

𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒊𝒏𝒕𝒆𝒓𝒄𝒍𝒂𝒔𝒔 𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆

𝒊, 𝒋 ∈𝒅𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒕

𝒄𝒍𝒂𝒔𝒔

Classification term

Metric learning term

86

Page 87: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ROBUSTNESS OF THIS NETWORK

• Metric learning objectives impose stability

• Similar to what we have in the random case

• Close points at the input are close at the output

• Using the theory of 𝑇, 𝜖 -robustness [Xu & Mannor, 2012], the generalization error scales as

𝑇

𝐿

• 𝑇 is the covering number and 𝐿 = Training set .

• Also here, the number of training samples scales as

exp(𝜔2 Υ /휀2) .

87

Page 88: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

RESULTS

• Better performance with less training samples#Training/class 30 50 70 100

original pixels 81.91% 86.18% 86.86% 88.49%

LeNet 87.51% 89.89% 91.24% 92.75%

Proposed 1 92.32% 94.45% 95.67% 96.19%

Proposed 2 94.14% 95.20% 96.05% 96.21%

MNIST Dataset

Faces in the wild

ROC curve

[Huang, Qiu, Sapiro, Calderbank, 2015]

88

Page 89: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

Minimization by

DNN

89

Page 90: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ASSUMPTIONS

𝑿 𝝍

𝑺

𝜓 is a projection

onto 𝛶𝒁 ∈ 𝜰

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝒁

An estimate of 𝑍

+

Linear operators

90

linear operator

noise

Page 91: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ℓ0-MINIMIZATION

𝜓 is the hard thresholding

operation: keeps the largest k entries

𝒁 is ak−sparsevecotr

𝑽 ∈ ℝ𝒅

𝑉 = 𝑍𝐴 + 𝐸

Iterative hard thresholding

algorithm (IHT)

𝝁𝑨𝑻 𝝍

𝑰 − 𝝁𝑨𝑨𝑻

𝒁+

𝜇 is the step size

A k-sparse estimate of 𝑍.Aim at solving

min 𝑍

𝑉 − 𝑍𝐴

𝑠. 𝑡 𝑍0

≤ k91

[Blumensath & Davies, 2009]

Page 92: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ℓ1-MINIMIZATION

𝜓 projects onto the ℓ1 ball

𝒁 𝟏 ≤ 𝑹

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬 Estimate of 𝑍.Aim at solving

min 𝑍

𝑉 − 𝑍𝐴

𝑠. 𝑡 𝑍1

≤ 𝑅

Projected gradient descent algorithm for ℓ1

minimization

𝝁𝑨𝑻 𝝍

𝑰 − 𝝁𝑨𝑨𝑻

𝒁+

𝑹

𝑹

−𝑹

−𝑹

𝜇 is the step size

92

Page 93: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

UNCONSTRAINED ℓ1-MINIMIZATION

𝝁𝑨𝑻

Soft thresholding

operation

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝒁+

Step size 𝜇 obeys1𝜇

≥ 𝐴

Iterative soft thresholding

algorithm (ISTA)

- 𝝀𝝁 Minimizer of

𝒎𝒊𝒏 𝒁

𝑽 − 𝒁𝑨 + 𝝀 𝒁𝟏

𝝀𝝁

𝜇 is the step size

[Daubechies, Defrise & Mol, 2004], [Beck & Teboulle, 2009]

93

𝑰 − 𝝁𝑨𝑨𝑻

𝝍

Page 94: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ISTA CONVERGENCE

• Reconstruction mean squared error (MSE) as a function of the number of iterations

94

𝑬 𝒁 − 𝒁𝒕

𝒕

Page 95: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LEARNED ISTA (LISTA)

𝑿 𝝍

𝑺

𝒁 ∈ 𝜰

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝒁

An estimate of 𝒁

+

Learned linear

operators

[Gregor & LeCun, 2010]

Soft thresholding

operation- 𝝀 𝝀

95

Page 96: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

• Replacing 𝐼 − 𝜇𝐴𝐴𝑇 and 𝜇𝐴𝑇 in ISTA with the learned 𝑋 and 𝑆 improves convergence [Gregor & LeCun, 2010]

• Extensions to other models [Sprechmann, Bronstein & Sapiro, 2015],

[Remez, Litani & Bronstein, 2015], [Tompson, Schlachter, Sprechmann & Perlin, 2016].

LISTA CONVERGENCE

100

20

96

𝑬 𝒁 − 𝒁𝒕

𝐸 𝑍 − 𝑍𝑡

𝒕5020

Page 97: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

PROJECTED GRADIENT DESCENT (PGD)

𝝍 projects onto the set 𝜰

𝒇( 𝒁) ≤ 𝑹

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝝁𝑨𝑻 𝝍 𝒁+

𝒇(𝒁) ≤ 𝑹

𝜇 is the step size

Estimate of 𝒁.Aim at solving

𝒎𝒊𝒏 𝒁

𝑽 − 𝒁𝑨

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹97

𝑰 − 𝝁𝑨𝑨𝑻

Page 98: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

THEORY FOR PGD

• Theorem 8: Let 𝑍 ∈ ℝ𝑑, 𝑓: ℝ𝑑 → ℝ a proper function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑥) the tangent cone of 𝑓at point 𝑥, 𝐴 ∈ ℝ𝑑×𝑚 a random Gaussian matrix and 𝑉 = 𝑍𝐴 + 𝐸. Then the estimate of PGD at iteration 𝑡, 𝑍𝑡, obeys

𝑍𝑡 − 𝑍 ≤ 𝜅𝑓𝜌𝑡

𝑍 ,

where 𝜌 = sup𝑈,𝑊∈𝐶𝑓 𝑥 ∩ℬ𝑑

𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇

and 𝜅𝑓 = 1 if 𝑓 is convex and 𝜅𝑓 = 2 otherwise.[Oymak, Recht & Soltanolkotabi, 2016].

98

Page 99: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

PGD CONVERGENCE RATE

• 𝜌 = sup𝑈,𝑊∈𝐶𝑓 𝑥 ∩ℬ𝑑

𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇 is the

convergence rate of PGD.

• Let 𝜔 be the Gaussian mean width of 𝐶𝑓 𝑥 ∩ ℬ𝑑.

• If 𝜇 =1

𝑚+ 𝑑2 ≃

1

𝑑then 𝜌 = 1 − 𝑂

𝑚−𝜔

𝑚+𝑑.

• If 𝜇 =1

𝑚then 𝜌 = 𝑂

𝜔

𝑚.

• For the 𝑘-sparse model 𝜔2 = 𝑂 𝑘log d

• For GMM with 𝑘 Gaussians 𝜔2 = 𝑂 𝑘 .

• How may we cause 𝜔 to become smaller for having a better convergence rate?

99

Page 100: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

INACCURATE PROJECTION

• PGD iterations projects onto Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 .

• Smaller Υ ⇒ Smaller 𝜔.

Faster convergence as

𝜌 = 1 − 𝑂𝑚−𝜔

𝑚+𝑑or 𝑂

𝜔

𝑚

• Let us assume that our signal belongs to a smaller set Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 with 𝜔 ≪ 𝜔.

• Ideally, we would like to project onto Υ instead of Υ.

• This will lead to faster convergence.

• What if such a projection is not feasible?

⇒ 𝒇( 𝒁) ≤ 𝑹

𝒇( 𝒁) ≤ 𝑹

100

𝜰

𝜰

Page 101: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

INACCURATE PROJECTION

• We will estimate the projection onto Υ by

• A linear projection 𝑃

• Followed by a projection onto Υ

• Assumptions:

• 𝑃(𝑍)−𝑍 ≤ ϵ

• ℘𝐶 𝑓(𝑍)(𝑈)−℘𝐶𝑓(𝑍𝑃)

(𝑈𝑃) ≤ ϵ, ∀𝑈 ∈ ℝ𝑑

Projection of 𝑈𝑃 onto thetangent cone of 𝑓 at point 𝑍𝑃

Projection of 𝑈 onto the

tangent cone of 𝑓 at point 𝑍

𝒇( 𝒁) ≤ 𝑹

101

Page 102: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

INACCURATE PGD (IPGD)

𝝍 projects onto the set Υ

𝒇(𝒁) ≤ 𝑹

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝝁𝑨𝑻𝑷 𝝍

𝑰 − 𝝁𝑨𝑨𝑻 𝑷

𝒁+

𝜰

𝜇 is the step size

Estimate of 𝑍.Aim at solving

𝒎𝒊𝒏 𝒁

𝑽 − 𝒁𝑨

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹

𝒇(𝒁) ≤ 𝑹

102

Page 103: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

THEORY FOR IPGD

• Theorem 9: Let 𝑍 ∈ ℝ𝑑, 𝑓: ℝ𝑑 → ℝ a proper function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑥) the tangent cone of 𝑓 at point 𝑥, 𝐴∈ ℝ𝑑×𝑚 a random Gaussian matrix and 𝑉 = 𝑍𝐴 + 𝐸. Then the estimate of IPGD at iteration 𝑡, 𝑍𝑡, obeys

𝑍𝑡 − 𝑍

≤ 𝜅𝑓 𝜌 + 𝜖𝛾𝑡

+1 − 𝜅𝑓 𝜌 + 𝜖𝛾

𝑡

1 − 𝜅𝑓 𝜌 + 𝜖𝛾𝜖 𝑍 ,

where 𝜌 = sup𝑈,𝑊∈𝐶 𝑓 𝑥 ∩ℬ𝑑

𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇

𝛾 = 𝐼 − 𝜇𝐴𝐴𝑇 and 𝜅𝑓 as in Theorem 8.[Giryes, Eldar, Bronstein & Sapiro, 2016]

103

Page 104: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

CONVERGENCE RATE COMPARISON

• PGD convergence:

𝜅𝑓𝜌𝑡

• IPGD convergence:

𝜅𝑓 𝜌 + 𝜖𝛾𝑡

+1 − 𝜅𝑓 𝜌 + 𝜖𝛾

𝑡

1 − 𝜅𝑓 𝜌 + 𝜖𝛾𝜖

≃(𝑎)

𝜅𝑓 𝜌𝑡

+ 𝜖

(a) assuming that 𝜖 is negligible compared to 𝜌.

(b) For small values of 𝑡 (early iterations).

• Faster convergence since 𝜌 ≪ 𝜌 (because 𝜔 ≪ 𝜔).

≪(𝑏)

𝜅𝑓 𝜌𝑡

104

Page 105: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

MODEL BASED COMPRESSED SENSING

• Υ is the set of sparse vectors with sparsity patterns that obey a tree structure.

• Projecting onto Υ improves convergence rate compared to projecting onto the set of sparse vectors Υ [Baraniuk et al., 2010].

• The projection onto Υ is more demanding than onto Υ.

• Note that the probability of selecting atoms from lower tree levels is smaller than upper ones.

• 𝑃 will be a projection onto certain tree levels – zeroing the values at lower levels.

1

0.5 0.5

0.25 0.25 0.25 0.25

105

Page 106: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

MODEL BASED COMPRESSED SENSING

Non-zeros picked entries has zero mean random Gaussian distribution with variance:- 1 at first two levels- 0.12 at the third level- 0.012 at the rest of the levels

106

Page 107: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SPECTRAL COMPRESSED SENSING

• Υ is the set of vectors with sparse representation in a 4-times redundant DCT dictionary such that:

• The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 5.

• The value of each representation coefficient ~𝑁(0,1) i.i.d.

• We set the neighboring coefficients at distance 1 and 2 of each active atom to be ~𝑁(0,0.12) and ~𝑁(0,0.012) , respectively

• We set 𝑃 to be a pooling-like operation that keeps in each window of size 5 only the largest value.

107

Page 108: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

SPECTRAL COMPRESSED SENSING

108

Page 109: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LEARNING THE PROJECTION

• If we have no explicit information about Υ it might be desirable to learn the projection.

• Instead of learning 𝑃, it is possible to replace

𝐼 − 𝜇𝐴𝐴𝑇 𝑃 and 𝜇𝐴𝑇𝑃 with two learned matrices

𝑆 and 𝑋 respectively.

• This leads to a very similar scheme to the one of LISTA and provides a theoretical foundation for the success of LISTA.

109

Page 110: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LEARNED IPGD

𝝍 projects onto the set 𝜰

𝒇(𝒁) ≤ 𝑹

𝑽 ∈ ℝ𝒅 𝑿 𝝍 𝒁+

𝜇 is the step size

Estimate of 𝒁.Aim at solving

𝒎𝒊𝒏 𝒁

𝑽 − 𝒁𝑨

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹

Learned linear

operators

110

𝑽 = 𝒁𝑨 + 𝑬 𝜰

𝒇(𝒁) ≤ 𝑹

𝑺

Page 111: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LISTA

𝜓 is a proximal mapping.

𝝍 𝑼 =

𝐚𝐫𝐠𝐦𝐢𝐧 𝒁∈ℝ𝒅

𝑼 − 𝒁

+𝝀𝒇( 𝒁)

𝑽 ∈ ℝ𝒅

𝑽 = 𝒁𝑨 + 𝑬

𝑿 𝝍

𝑺

𝒁+

𝜰

𝜇 is the step size

Estimate of 𝒁.Aim at solving

𝒎𝒊𝒏 𝒁

𝑽 − 𝒁𝑨

+𝝀 𝒇( 𝒁)

𝒇(𝒁) ≤ 𝑹

Learned linear

operators

111

Page 112: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

LISTA MIXTURE MODEL

• Approximation of the projection onto Υwith one linear projection may not be accurate enough.

• This requires more LISTA layers/iterations.

• Instead, one may use several LISTA networks, where each approximates a different part of Υ

• Training 18 LISTA networks, each with3 layers, provides the same accuracylike 1 LISTA network with 10 layers.

Υ

Υ

112

Page 113: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

RELATED WORKS

• In [Bruna et al. 2016] a different route it taken to explain the faster convergence of LISTA. It is shown that a learning may give a gain due to better preconditioning of A.

113

Page 114: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

Take Home

Message

DNN keep the

important information of the data.

Gaussian mean width is a good measure for the

complexity of the data.

Random Gaussian

weights are good for

classifying the average points

in the data.

Important goal of training: Classify the

boundary points between the

different classes in the data.

Generalization error depends

on the DNN input margin

Deep learning can be viewed

as a metric learning.

DNN may solve

optimization problems

114

Page 115: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

ACKNOWLEDGEMENTS

Guillermo Sapiro

Duke University

Robert Calderbank

Duke University

Qiang Qiu

Duke University

Jiaji Huang

Baidu SVAIL

Alex M. Bronstein

Technion

Jure Sokolic

UCL

Miguel Rodrigues

UCL

Yonina C. Eldar

Technion

115

Joan Bruna

NYU

René Vidal

Johns Hopkins

Page 116: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

QUESTIONS?

WEB.ENG.TAU.AC.IL/~RAJA

116

Page 117: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 1

• A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal, vol. 3, no. 3, pp. 535–554, 1959.

• D. H. Hubel & T. N. Wiesel, “Receptive fields of single neurones in the cat's striate cortex”, J Physiol., vol. 148, no. 3, pp. 574-591, 1959.

• D. H. Hubel & T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex”, J Physiol., vol. 160, no. 1, pp. 106-154, 1962.

• K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, Biological Cybernetics, vol. 36, no. 4, pp. 93-202, 1980.

• Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard & L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition”, Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.

• Y.LeCun, L. Bottou, Y. Bengio & P. Haffner, “Gradient Based Learning Applied to Document Recognition”, Proceedings of IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

• C. Farabet, C. Couprie, L. Najman & Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 8, pp. 1915-1929, Aug. 2013.

• O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge”, International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015

• A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks”, NIPS, 2012.

• K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition”, CVPR, 2016.

117

Page 118: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 2

• M.D. Zeiler, R. Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV, 2014.

• D. Yu and L. Deng, “Automatic Speech Recognition: A Deep Learning Approach”, Springer, 2014.

• J. Bellegarda & C. Monz, “State of the art in statistical methods for language and speech processing,” Computer Speech and Language, vol. 35, pp. 163–184, Jan. 2016.

• C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, “Going Deeper with Convolutions”, CVPR, 2015.

• V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra & M. Riedmiller, “Playing Atari with Deep Reinforcement Learning”, NIPS deep learning workshop, 2013.

• V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg & D. Hassabis, “Human-level control through deep reinforcement learning”, Nature vol. 518, pp. 529–533, Feb. 2015.

• D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel & D. Hassabis, “Mastering the Game of Go with Deep Neural Networks and Tree Search”, Nature, vol. 529, pp. 484–489, 2016.

• S. K. Zhou, H. Greenspan, D. Shen, “Deep Learning for Medical Image Analysis”, Academic Press, 2017.

• I. Sutskever, O. Vinyals & Q. Le, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014.

• A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks”, CVPR, 2014.

118

Page 119: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 3• F. Schroff, D. Kalenichenko & J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering”,

CVPR, 2015.

• A. Poznanski & L. Wolf, “CNN-N-Gram for Handwriting Word Recognition”, CVPR, 2016.

• R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng & C. Potts, “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP, 2013.

• H. C. Burger, C. J. Schuler & S. Harmeling, Image denoising: Can plain Neural Networks compete with BM3D?, CVPR, 2012.

• J. Kim, J. K. Lee, K. M. Lee, “Accurate Image Super-Resolution Using Very Deep Convolutional Networks”, CVPR, 2016.

• J, Bruna, P. Sprechmann, and Y. LeCun, “Super-Resolution with Deep Convolutional Sufficient Statistics”, ICLR, 2016.

• V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines”, ICML, 2010.

• L. Deng & D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7 no. 3-4, pp. 197–387, 2014.

• Bengio, Yoshua, “Learning Deep Architectures for AI”, Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

• Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015.

• J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural Networks, vol. 61, pp. 85–117, Jan. 2015.

• I. Goodfellow, Y. Bengio & A. Courville, “Deep learning”, Book in preparation for MIT Press, 2016.

119

Page 120: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 4

• G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Control Signals Systems, vol. 2, pp. 303–314, 1989.

• K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Netw., vol. 4, no. 2, pp. 251–257, 1991.

• A. R. Barron, Approximation and estimation bounds for artificial neural networks, Machine Learning, vol. 14, no. 1, pp. 115–133, Jan. 1994.

• G. F. Montu ́faar & J. Morton, “When does a mixture of products contain a product of mixtures”, SIAM Journal on Discrete Mathematics (SIDMA), vol. 29, no. 1, pp. 321-347, 2015.

• G. F. Montu ́faar, R. Pascanu, K. Cho, & Y. Bengio, “On the number of linear regions of deep neural networks,” NIPS, 2014.

• N. Cohen, O. Sharir & A. Shashua, “Deep SimNets,” CVPR, 2016.

• N. Cohen, O. Sharir & A. Shashua, “On the Expressive Power of Deep Learning: A Tensor Analysis,” COLT, 2016.

• N. Cohen & A. Shashua, “Convolutional Rectifier Networks as Generalized Tensor Decompositions,” ICML, 2016

• M. Telgarsky, “Benefits of depth in neural networks,” COLT, 2016.

• N. Cohen and A. Shashua, “Inductive Bias of Deep Convolutional Networks through Pooling Geometry,” arXivabs/ 1605.06743, 2016.

• J. Bruna, Y. LeCun, & A. Szlam, “Learning stable group invariant representations with convolutional networks,” ICLR, 2013.

• Y-L. Boureau, J. Ponce, Y. LeCun, Theoretical Analysis of Feature Pooling in Visual Recognition, ICML, 2010.

120

Page 121: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 5

• J. Bruna, A. Szlam, & Y. LeCun, “Signal recovery from lp pooling representations”, ICML, 2014.

• S. Soatto & A. Chiuso, “Visual Representations: Defining properties and deep approximation”, ICLR 2016.

• F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio, “Unsupervised learning of invariant representations in hierarchical architectures,” Theoretical Computer Science, vol. 663, no. C, pp. 112-121, Jun. 2016.

• J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 8, pp. 1872–1886, Aug 2013.

• T. Wiatowski and H. Bölcskei, “A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction,” arXiv abs/1512.06293, 2016

• A. Saxe, J. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural network”, ICLR, 2014.

• Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high dimensional non-convex optimization,” NIPS, 2014.

• A. Choromanska, M. B. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.

• B. D. Haeffele and R. Vidal. Global Optimality in Tensor Factorization, Deep Learning, and Beyond. arXiv, abs/1506.07540, 2015.

• S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable bounds for learning some deep representations,” in Int. Conf. on Machine Learning (ICML), 2014, pp. 584–592.

121

Page 122: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 6

• A. M. Bruckstein, D. L. Donoho, & M. Elad, “From sparse solutions ofa systems ofa equations to sparse modeling ofa signals and images”, SIAM Review, vol. 51, no. 1, pp. 34–81, 2009.

• G. Yu, G. Sapiro & S. Mallat, “Solving inverse problems with piecewise linear estimators: From Gaussian mixture models to structured sparsity”, IEEE Trans. on Image Processing, vol. 21, no. 5, pp. 2481 –2499, May 2012.

• N. Srebro & A. Shraibman, “Rank, trace-norm and max-norm,” COLT, 2005.

• E. Cand`es & B. Recht, “Exact matrix completion via convex optimization,” Foundations ofa Computational mathematics, vol. 9, no. 6, pp. 717– 772, 2009.

• R. G. Baraniuk, V. Cevher & M. B. Wakin, "Low-Dimensional Models for Dimensionality Reduction and Signal Recovery: A Geometric Perspective," Proceedings of the IEEE, vol. 98, no. 6, pp. 959-971, 2010.

• Y. Plan and R. Vershynin, “Dimension reduction by random hyperplane tessellations,” Discrete and Computational Geometry, vol. 51, no. 2, pp. 438–461, 2014.

• Y. Plan and R. Vershynin, “Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach,” IEEE Trans. Infa. Theory, vol. 59, no. 1, pp. 482–494, Jan. 2013.

• R. Giryes, G. Sapiro and A.M. Bronstein, “Deep Neural Networks with Random Gaussian Weights: A Universal Classifaication Strategy? “, IEEE Transactions onSignal Processing, vol. 64, no. 13, pp. 3444-3457, Jul. 2016.

• A. Choromanska, K. Choromanski, M. Bojarski, T. Jebara, S. Kumar, Y. LeCun, “Binary embeddings with structured hashed projections”, ICML, 2016.

• J. Masci, M. M. Bronstein, A. M. Bronstein and J. Schmidhuber, “Multimodal Similarity-Preserving Hashing”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 36, no. 4, pp. 824-830, April 2014.

122

Page 123: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 7

• H. Lai, Y. Pan, Y. Liu & S. Yan, “Simultaneous Feature Learning and Hash Coding With Deep Neural Networks”, CVPR, 2015.

• A. Mahendran & A. Vedaldi, “Understanding deep image representations by inverting them,” CVPR, 2015.

• K. Simonyan & A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, ICLR, 2015

• A. Krogh & J. A. Hertz, “A Simple Weight Decay Can Improve Generalization”, NIPS, 1992.

• P. Baldi & P. Sadowski, “Understanding dropout”, NIPS, 2013.

• N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, & R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

• L. Wan, M. Zeiler, S. Zhang, Y. LeCun & R. Fergus, “Regularization of Neural Networks using DropConnect”, ICML, 2013.

• S. Ioffe & C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.

• M. Hardt, B. Recht & Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent”, arXiv, abs/1509.01240, 2016.

• B. Neyshabur, R. Salakhutdinov & N. Srebro, “Path-SGD: Path-normalized optimization in deep neural networks,” NIPS, 2015.

• S. Rifai, P. Vincent, X. Muller, X. Glorot, & Y. Bengio. “Contractive auto-encoders: explicit invariance during feature extraction,” ICML, 2011.

123

Page 124: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 8

• T. Salimans & D. Kingma, “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”, arXiv abs/1602.07868, 2016.

• S. Sun, W. Chen, L. Wang, & T.-Y. Liu, “Large margin deep neural networks: theory and algorithms”, AAAI, 2016.

• S. Shalev-Shwartz & S. Ben-David. “Understanding machine learning: from theory to algorithms”, Cambridge University Press, 2014.

• P. L. Bartlett & S. Mendelson, “Rademacher and Gaussian complexities: risk bounds and structural results”. The Journal of Machine Learning Research (JMLR), vol 3, pp. 463–482, 2002.

• B. Neyshabur, R. Tomioka, and N. Srebro, “Norm-based capacity control in neural networks,” COLT, 2015.

• J. Sokolic, R. Giryes, G. Sapiro, M. R. D. Rodrigues, “Margin Preservation of Deep Neural Networks”, arXiv, abs/1605.08254, 2016.

• H. Xu and S. Mannor. “Robustness and generalization,” JMLR, vol. 86, no. 3, pp. 391–423, 2012.

• J. Huang, Q. Qiu, G. Sapiro, R. Calderbank, “Discriminative Geometry-Aware Deep Transform”, ICCV 2015

• J. Huang, Q. Qiu, G. Sapiro, R. Calderbank, “Discriminative Robust Transformation Learning”, NIPS 2016.

• T. Blumensath & M.E. Davies, “Iterative hard thresholding for compressed sensing”, Appl. Comput. Harmon. Anal, vol. 27, no. 3, pp. 265 – 274, 2009.

• I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint”, Communicationson Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.

124

Page 125: RAJA GIRYES TEL AVIV UNIVERSITY - TAUtam-son3.eng.tau.ac.il/~raja/wp...Tutorial-Math-Deep-Learning-Raja.pdf · RAJA GIRYES TEL AVIV UNIVERSITY Mathematics of Deep Learning Tutorial

FULL REFERENCES 9

• A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, Mar. 2009.

• K. Gregor & Y. LeCun, “Learning fast approximations of sparse coding”, ICML, 2010.

• P. Sprechmann, A. M. Bronstein & G. Sapiro, “Learning efficient sparse and low rank models”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1821–1833, Sept. 2015.

• T. Remez, O. Litany, & A. M. Bronstein, “A picture is worth a billion bits: Real-time image reconstruction from dense binary pixels”, ICCP, 2015.

• J. Tompson, K. Schlachter, P. Sprechmann & K. Perlin, “Accelerating Eulerian Fluid Simulation With Convolutional Networks”, arXiv, abs/1607.03597, 2016.

• S. Oymak, B. Recht, & M. Soltanolkotabi, “Sharp time–data tradeoffs for linear inverse problems”, arXiv, abs/1507.04793, 2016.

• R. Giryes, Y. C. Eldar, A. M. Bronstein, G. Sapiro, “Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse Problems”, arXiv, abs/1605.09232, 2016.

• R.G. Baraniuk, V. Cevher, M.F. Duarte & C. Hegde, “Model-based compressive sensing”, IEEE Trans. Inf. Theory, vol. 56, no. 4, pp. 1982–2001, Apr. 2010.

• M. F. Duarte & R. G. Baraniuk, “Spectral compressive sensing”, Appl. Comput. Harmon. Anal., vol. 35, no. 1, pp. 111 – 129, 2013.

• J. Bruna & T. Moreau, Adaptive Acceleration of Sparse Coding via Matrix Factorization, arXivabs/1609.00285, 2016.

125