deep learning as a mathematician - univie.ac.at · works really well, there must be a good...

141
Deep Learning as a Mathematician Philipp Grohs September 12th 2017

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Deep Learning as a Mathematician

Philipp Grohs

September 12th 2017

Page 2: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Syllabus

1 Why Mathematical Understanding?

2 Approximation Power

3 Stochastic Optimization

Page 3: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

1 Why Mathematical Understanding?

Page 4: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Curiosity

It is the guiding principle ofmany applied mathematiciansthat if something mathematicalworks really well, there must bea good underlying mathematicalreason for it, and we ought to beable to understand it.[Ingrid Daubechies. Big Data’sMathematical Myteries, Quanta

Magazine (2015)]

Page 5: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Improve Usability/Availability

∼ 800 page book explaining variousad-hoc tricks, which are necessary for

good performance.

Page 6: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Be More Efficient

Electricity bill for one game of AlphaGo:∼ 3000 USD

Page 7: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Be More Efficient

Electricity bill for one game of AlphaGo:

∼ 3000 USD

Page 8: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Be More Efficient

Electricity bill for one game of AlphaGo:∼ 3000 USD

Page 9: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Be More Efficient

SIRI’s neural network is too large to operate on an Iphone!

Page 10: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Be More Efficient

SIRI’s neural network is too large to operate on an Iphone!

Page 11: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Today’s Focus: Vanilla Neural Networks Regression

Neural Network Hypothesis Class

Given d , L,N1, . . . ,NL and σ define the associated hypothesis class

H[d ,N1,...,NL],σ :={ALσ (AL−1σ (. . . σ (A1(x)))) : A` : RN`−1 → RN` affine linear

}.

Simplest Regression/Classification Task

Given data z = ((xi , yi ))mi=1 ⊂ Rd ×RNL , find the empirical regressionfunction

fz ∈ argminf ∈H[d,N1,...,NL],σ

m∑i=1

L(f , xi , yi ),

where L : C (Rd)× Rd × RNL → R+ is the loss function (in leastsquares problems we have L(f , x , y) = |f (x)− y |2).

Page 12: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Today’s Focus: Vanilla Neural Networks Regression

Neural Network Hypothesis Class

Given d , L,N1, . . . ,NL and σ define the associated hypothesis class

H[d ,N1,...,NL],σ :={ALσ (AL−1σ (. . . σ (A1(x)))) : A` : RN`−1 → RN` affine linear

}.

Simplest Regression/Classification Task

Given data z = ((xi , yi ))mi=1 ⊂ Rd ×RNL , find the empirical regressionfunction

fz ∈ argminf ∈H[d,N1,...,NL],σ

m∑i=1

L(f , xi , yi ),

where L : C (Rd)× Rd × RNL → R+ is the loss function (in leastsquares problems we have L(f , x , y) = |f (x)− y |2).

Page 13: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Today’s Focus: Vanilla Neural Networks Regression

Neural Network Hypothesis Class

Given d , L,N1, . . . ,NL and σ define the associated hypothesis class

H[d ,N1,...,NL],σ :={ALσ (AL−1σ (. . . σ (A1(x)))) : A` : RN`−1 → RN` affine linear

}.

Simplest Regression/Classification Task

Given data z = ((xi , yi ))mi=1 ⊂ Rd ×RNL , find the empirical regressionfunction

fz ∈ argminf ∈H[d,N1,...,NL],σ

m∑i=1

L(f , xi , yi ),

where L : C (Rd)× Rd × RNL → R+ is the loss function (in leastsquares problems we have L(f , x , y) = |f (x)− y |2).

Page 14: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

MNIST Database for hand-written digit recognitionhttp://yann.lecun.com/

exdb/mnist/

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Page 15: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

MNIST Database for hand-written digit recognitionhttp://yann.lecun.com/

exdb/mnist/

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Page 16: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

MNIST Database for hand-written digit recognitionhttp://yann.lecun.com/

exdb/mnist/

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Page 17: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

MNIST Database for hand-written digit recognitionhttp://yann.lecun.com/

exdb/mnist/

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Page 18: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

MNIST Database for hand-written digit recognitionhttp://yann.lecun.com/

exdb/mnist/

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Page 19: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Given labeled trainingdata(xi , yi )

mi=1 ⊂ R784 × R10.

Fix network topology, e.g.,number of layers (forexample L = 3) andnumbers of neurons(N1 = 20, N2 = 20).

The learning goal is tofind the empiricalregression functionfz ∈ H[784,20,20,10],σ.

Typically solved bystochastic first orderapproximation methods.

Page 20: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Given labeled trainingdata(xi , yi )

mi=1 ⊂ R784 × R10.

Fix network topology, e.g.,number of layers (forexample L = 3) andnumbers of neurons(N1 = 20, N2 = 20).

The learning goal is tofind the empiricalregression functionfz ∈ H[784,20,20,10],σ.

Typically solved bystochastic first orderapproximation methods.

Page 21: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Given labeled trainingdata(xi , yi )

mi=1 ⊂ R784 × R10.

Fix network topology, e.g.,number of layers (forexample L = 3) andnumbers of neurons(N1 = 20, N2 = 20).

The learning goal is tofind the empiricalregression functionfz ∈ H[784,20,20,10],σ.

Typically solved bystochastic first orderapproximation methods.

Page 22: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Given labeled trainingdata(xi , yi )

mi=1 ⊂ R784 × R10.

Fix network topology, e.g.,number of layers (forexample L = 3) andnumbers of neurons(N1 = 20, N2 = 20).

The learning goal is tofind the empiricalregression functionfz ∈ H[784,20,20,10],σ.

Typically solved bystochastic first orderapproximation methods.

Page 23: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Example: Handwritten Digits

Every image is given as a28× 28 matrixx ∈ R28×28 ∼ R784:

Every label is given as a10-dim vector y ∈ R10

describing the ‘probability’of each digit

Given labeled trainingdata(xi , yi )

mi=1 ⊂ R784 × R10.

Fix network topology, e.g.,number of layers (forexample L = 3) andnumbers of neurons(N1 = 20, N2 = 20).

The learning goal is tofind the empiricalregression functionfz ∈ H[784,20,20,10],σ.

Typically solved bystochastic first orderapproximation methods.

Page 24: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

How Can Mathematics Contribute?

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 25: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

How Can Mathematics Contribute?

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 26: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

How Can Mathematics Contribute?

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 27: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

How Can Mathematics Contribute?

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 28: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

How Can Mathematics Contribute?

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 29: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

This Talk

1 Approximation Power of (Convolutional) Neural Networks

2 Convergence Properties of Stochastic Optimization Algorithms

3 Generalization of Neural Networks

4 Invariances and Discriminatory Properties

Page 30: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

1. Approximation Power

Page 31: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Universal Approximation Theorem

Theorem [Cybenko (1989), Hornik (1991 )]

Suppose that σ : R→ R continuous is not a polynomial and fixd ≥ 1, L ≥ 2,NL ≥ 1 ∈ N and a compact subset K ⊂ Rd . Then forany continuous f : Rd → RNL and any ε > 0 there existN1, . . . ,NL−1 ∈ N and affine linear maps A` : RN`−1 → RN` ,1 ≤ ` ≤ L such that the neural network

Φ(x) = ALσ (AL−1σ (. . . σ (A1(x)))), x ∈ Rd ,

approximates f to within accuracy ε, i.e.,

supx∈K|f (x)− Φ(x)| ≤ ε.

Does not imply any quantitative results (e.g., how many nodesto achieve a desired accuracy?).

Page 32: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Universal Approximation Theorem

Theorem [Cybenko (1989), Hornik (1991 )]

Suppose that σ : R→ R continuous is not a polynomial and fixd ≥ 1, L ≥ 2,NL ≥ 1 ∈ N and a compact subset K ⊂ Rd . Then forany continuous f : Rd → RNL and any ε > 0 there existN1, . . . ,NL−1 ∈ N and affine linear maps A` : RN`−1 → RN` ,1 ≤ ` ≤ L such that the neural network

Φ(x) = ALσ (AL−1σ (. . . σ (A1(x)))), x ∈ Rd ,

approximates f to within accuracy ε, i.e.,

supx∈K|f (x)− Φ(x)| ≤ ε.

Does not imply any quantitative results (e.g., how many nodesto achieve a desired accuracy?).

Page 33: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Quantitative Universal Approximation Theorem

Theorem [Maiorov - Pinkus (1999)]

There exists an activation function σ : R→ R that is smooth,monotone increasing and sigmoidal (limt→∞ σ(t) = 1 andlimt→−∞ σ(t) = 0) with the following property: For any ε > 0, anyd ≥ 1, any compact subset K ⊂ Rd and any continuousf : Rd → RNL there exist affine linear maps A1 : Rd → R3d ,A2 : R3d → R6d+3, A2 : R6d+3 → RNL such that the neural network

Φ(x) = A3σ (A2σ (A1(x))), x ∈ Rd ,

approximates f to within accuracy ε, i.e.,

supx∈K|f (x)− Φ(x)| ≤ ε.

In other words, we can approximate any function up to any accuracywith a fixed number of coefficients????

Page 34: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Quantitative Universal Approximation Theorem

Theorem [Maiorov - Pinkus (1999)]

There exists an activation function σ : R→ R that is smooth,monotone increasing and sigmoidal (limt→∞ σ(t) = 1 andlimt→−∞ σ(t) = 0) with the following property: For any ε > 0, anyd ≥ 1, any compact subset K ⊂ Rd and any continuousf : Rd → RNL there exist affine linear maps A1 : Rd → R3d ,A2 : R3d → R6d+3, A2 : R6d+3 → RNL such that the neural network

Φ(x) = A3σ (A2σ (A1(x))), x ∈ Rd ,

approximates f to within accuracy ε, i.e.,

supx∈K|f (x)− Φ(x)| ≤ ε.

In other words, we can approximate any function up to any accuracywith a fixed number of coefficients????

Page 35: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Universal Architecture

Where is the catch?

Page 36: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Universal Architecture

Where is the catch?

Page 37: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Meaningful Notion of Approximation

Definition

Let K ⊂ Rd be compact. Any C ⊂ L2(K ) that is relatively compactis called a regression problem class.

Definition

Let σ : R→ R be an activation function and denote NNM,σ,R theset of neural networks with activation function σ and at most Mcoefficients which are all bounded by R.

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

A regression problem class C has effective approximation rate γ ifthere exists a constant C > 0 and a polynomial π with

supf ∈C

infΦ∈NNM,σ,π(M)

‖f − Φ‖L2(K) ≤ C ·M−γ .

Page 38: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Meaningful Notion of Approximation

Definition

Let K ⊂ Rd be compact. Any C ⊂ L2(K ) that is relatively compactis called a regression problem class.

Definition

Let σ : R→ R be an activation function and denote NNM,σ,R theset of neural networks with activation function σ and at most Mcoefficients which are all bounded by R.

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

A regression problem class C has effective approximation rate γ ifthere exists a constant C > 0 and a polynomial π with

supf ∈C

infΦ∈NNM,σ,π(M)

‖f − Φ‖L2(K) ≤ C ·M−γ .

Page 39: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Meaningful Notion of Approximation

Definition

Let K ⊂ Rd be compact. Any C ⊂ L2(K ) that is relatively compactis called a regression problem class.

Definition

Let σ : R→ R be an activation function and denote NNM,σ,R theset of neural networks with activation function σ and at most Mcoefficients which are all bounded by R.

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

A regression problem class C has effective approximation rate γ ifthere exists a constant C > 0 and a polynomial π with

supf ∈C

infΦ∈NNM,σ,π(M)

‖f − Φ‖L2(K) ≤ C ·M−γ .

Page 40: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Why is this Meaningful?

Effective approximation rate implies efficient storage!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

If C has effective approximation rate γ, then every f ∈ C can beapproximated to within error N−γ by a neural network whosetopology and quantized coefficients can be stored with C · N · log(N)bits.

Effective approximation rate correlates with complexity of regressionproblem class!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let s(C) be the Kolmogorov entropy of C (a measure of complexitythat can be computed). Then γ ≤ 1/s(C). In particular, this scalingbetween accuracy and complexity has to be obeyed by all learningalgorithms!

Page 41: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Why is this Meaningful?

Effective approximation rate implies efficient storage!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

If C has effective approximation rate γ, then every f ∈ C can beapproximated to within error N−γ by a neural network whosetopology and quantized coefficients can be stored with C · N · log(N)bits.

Effective approximation rate correlates with complexity of regressionproblem class!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let s(C) be the Kolmogorov entropy of C (a measure of complexitythat can be computed). Then γ ≤ 1/s(C). In particular, this scalingbetween accuracy and complexity has to be obeyed by all learningalgorithms!

Page 42: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Why is this Meaningful?

Effective approximation rate implies efficient storage!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

If C has effective approximation rate γ, then every f ∈ C can beapproximated to within error N−γ by a neural network whosetopology and quantized coefficients can be stored with C · N · log(N)bits.

Effective approximation rate correlates with complexity of regressionproblem class!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let s(C) be the Kolmogorov entropy of C (a measure of complexitythat can be computed). Then γ ≤ 1/s(C). In particular, this scalingbetween accuracy and complexity has to be obeyed by all learningalgorithms!

Page 43: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Why is this Meaningful?

Effective approximation rate implies efficient storage!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

If C has effective approximation rate γ, then every f ∈ C can beapproximated to within error N−γ by a neural network whosetopology and quantized coefficients can be stored with C · N · log(N)bits.

Effective approximation rate correlates with complexity of regressionproblem class!

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let s(C) be the Kolmogorov entropy of C (a measure of complexitythat can be computed). Then γ ≤ 1/s(C). In particular, this scalingbetween accuracy and complexity has to be obeyed by all learningalgorithms!

Page 44: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...or more formally...

Theorem [Bolcskei-G-Kutyniok-Petersen (2017)]

Consider any learning algorithm Learn : (0, 1)× C → NN (NNbeing the class of neural networks) which satisfies

supF∈C‖F − Learn(ε,F )‖ ≤ ε

and the weights of Learn(ε,F ) at most growing polynomially in ε−1.

Then, for each γ < γ∗(C) there exists a regression problem F ∈ Cwith

supε∈(0,1)

εγ · size(Learn(ε,F )) =∞.

Coefficients in Pinkus’ network are so large that they cannot bestored!

Page 45: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...or more formally...

Theorem [Bolcskei-G-Kutyniok-Petersen (2017)]

Consider any learning algorithm Learn : (0, 1)× C → NN (NNbeing the class of neural networks) which satisfies

supF∈C‖F − Learn(ε,F )‖ ≤ ε

and the weights of Learn(ε,F ) at most growing polynomially in ε−1.Then, for each γ < γ∗(C) there exists a regression problem F ∈ Cwith

supε∈(0,1)

εγ · size(Learn(ε,F )) =∞.

Coefficients in Pinkus’ network are so large that they cannot bestored!

Page 46: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...or more formally...

Theorem [Bolcskei-G-Kutyniok-Petersen (2017)]

Consider any learning algorithm Learn : (0, 1)× C → NN (NNbeing the class of neural networks) which satisfies

supF∈C‖F − Learn(ε,F )‖ ≤ ε

and the weights of Learn(ε,F ) at most growing polynomially in ε−1.Then, for each γ < γ∗(C) there exists a regression problem F ∈ Cwith

supε∈(0,1)

εγ · size(Learn(ε,F )) =∞.

Coefficients in Pinkus’ network are so large that they cannot bestored!

Page 47: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...we can finally ask a meaningful question

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

Neural networks are optimal for a regression problem class C if C haseffective approximation rate γ for all γ < 1/s(C).

Questions

Characterize the regression problem classes for which neuralnetworks are optimal!

What architectures (deep, shallow,...) are good for whichregression problem classes?

Page 48: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...we can finally ask a meaningful question

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

Neural networks are optimal for a regression problem class C if C haseffective approximation rate γ for all γ < 1/s(C).

Questions

Characterize the regression problem classes for which neuralnetworks are optimal!

What architectures (deep, shallow,...) are good for whichregression problem classes?

Page 49: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...we can finally ask a meaningful question

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

Neural networks are optimal for a regression problem class C if C haseffective approximation rate γ for all γ < 1/s(C).

Questions

Characterize the regression problem classes for which neuralnetworks are optimal!

What architectures (deep, shallow,...) are good for whichregression problem classes?

Page 50: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

...we can finally ask a meaningful question

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

Neural networks are optimal for a regression problem class C if C haseffective approximation rate γ for all γ < 1/s(C).

Questions

Characterize the regression problem classes for which neuralnetworks are optimal!

What architectures (deep, shallow,...) are good for whichregression problem classes?

Page 51: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let C be a ball of any classical approximation space (for exampleSobolev, Besov, Shearlet, Kernel Approximation Space, piecewisesmooth functions on submanifolds, ...). Then neural networks areoptimal for C.

This means neural networks are as good as all classical ‘linear’methods combined!

They are even better: Result remains true if signal class isdefined on a submanifold and/or is warped by a smoothdiffeomorphism.

Page 52: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let C be a ball of any classical approximation space (for exampleSobolev, Besov, Shearlet, Kernel Approximation Space, piecewisesmooth functions on submanifolds, ...). Then neural networks areoptimal for C.

This means neural networks are as good as all classical ‘linear’methods combined!

They are even better: Result remains true if signal class isdefined on a submanifold and/or is warped by a smoothdiffeomorphism.

Page 53: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] Informal Version)

Let C be a ball of any classical approximation space (for exampleSobolev, Besov, Shearlet, Kernel Approximation Space, piecewisesmooth functions on submanifolds, ...). Then neural networks areoptimal for C.

This means neural networks are as good as all classical ‘linear’methods combined!

They are even better: Result remains true if signal class isdefined on a submanifold and/or is warped by a smoothdiffeomorphism.

Page 54: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Detour: Sparse Coding

Given a dictionary D = (ϕi )i∈N ⊂ L2(Γ), approximate everyF ∈ C by optimally sparse linear combinations of D, i.e.∑

i∈Jciϕi

with |J| as small as possible.

For most known C there exists a specific optimal dictionary thatachieves the optimal tradeoff between sparsity (∼ codelength!)and approximation error, as dictated by γ∗(C).

Examples: Textures ↔ Gabor frames (JPEG), point singularities↔ wavelets (JPEG2000), line/hyperplane singularities ↔ridgelets, curved/hypersurface singularities ↔ (α-)curvelets.

Page 55: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Detour: Sparse Coding

Given a dictionary D = (ϕi )i∈N ⊂ L2(Γ), approximate everyF ∈ C by optimally sparse linear combinations of D, i.e.∑

i∈Jciϕi

with |J| as small as possible.

For most known C there exists a specific optimal dictionary thatachieves the optimal tradeoff between sparsity (∼ codelength!)and approximation error, as dictated by γ∗(C).

Examples: Textures ↔ Gabor frames (JPEG), point singularities↔ wavelets (JPEG2000), line/hyperplane singularities ↔ridgelets, curved/hypersurface singularities ↔ (α-)curvelets.

Page 56: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Detour: Sparse Coding

Given a dictionary D = (ϕi )i∈N ⊂ L2(Γ), approximate everyF ∈ C by optimally sparse linear combinations of D, i.e.∑

i∈Jciϕi

with |J| as small as possible.

For most known C there exists a specific optimal dictionary thatachieves the optimal tradeoff between sparsity (∼ codelength!)and approximation error, as dictated by γ∗(C).

Examples: Textures ↔ Gabor frames (JPEG), point singularities↔ wavelets (JPEG2000), line/hyperplane singularities ↔ridgelets, curved/hypersurface singularities ↔ (α-)curvelets.

Page 57: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Detour: Sparse Coding

Given a dictionary D = (ϕi )i∈N ⊂ L2(Γ), approximate everyF ∈ C by optimally sparse linear combinations of D, i.e.∑

i∈Jciϕi

with |J| as small as possible.

For most known C there exists a specific optimal dictionary thatachieves the optimal tradeoff between sparsity (∼ codelength!)and approximation error, as dictated by γ∗(C).

Examples: Textures ↔ Gabor frames (JPEG), point singularities↔ wavelets (JPEG2000), line/hyperplane singularities ↔ridgelets, curved/hypersurface singularities ↔ (α-)curvelets.

Page 58: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Transferring Optimality

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

A dictionary D = (ϕi )i∈N is representable by neural networks if for allε > 0 and i ∈ N there is Φi ,ε ∈ NN with O(1) nonzero weights,growing at most polynomially in i · ε−1 with

‖ϕi − Φi ,ε‖ ≤ ε.

Theorem [Bolcskei-G-Kutyniok-Petersen (2017)]

Suppose that the dictionary D is optimal for the class C and supposethat D is representable by neural networks. Then neural networks areoptimal for the regression class C.

Page 59: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Transferring Optimality

Definition [Bolcskei-G-Kutyniok-Petersen (2017)]

A dictionary D = (ϕi )i∈N is representable by neural networks if for allε > 0 and i ∈ N there is Φi ,ε ∈ NN with O(1) nonzero weights,growing at most polynomially in i · ε−1 with

‖ϕi − Φi ,ε‖ ≤ ε.

Theorem [Bolcskei-G-Kutyniok-Petersen (2017)]

Suppose that the dictionary D is optimal for the class C and supposethat D is representable by neural networks. Then neural networks areoptimal for the regression class C.

Page 60: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕ is a sparse approximation of F in D.

Page 61: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕ is a sparse approximation of F in D.

Page 62: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕ is a sparse approximation of F in D.

Page 63: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕi is a sparse approximation of F ∈ D.

Page 64: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕi is a sparse approximation of F ∈ D.

∼ ϕ3

Page 65: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕi is a sparse approximation of F ∈ D.

∼ ϕ3∼ ϕ1 ∼ ϕ2 ∼ ϕ4

c1 c2 c3 c4

Page 66: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Idea

Suppose that∑4

i=1 ciϕi is a sparse approximation of F ∈ D.

∼ ϕ3∼ ϕ1 ∼ ϕ2 ∼ ϕ4

c1 c2 c3 c4

∼∑4

i=1 ciϕi

Page 67: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

The Good News

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] informal version)

All known (affine) dictionaries are representable by (shallow) neuralnetworks (under weak assumptions on the activation function).

Neural network regression is as powerful as regressionwith all known dictionaries, combined and in particular optimal for allcorresponding problem classes!

Page 68: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

The Good News

Theorem ([Bolcskei-G-Kutyniok-Petersen (2017)] informal version)

All known (affine) dictionaries are representable by (shallow) neuralnetworks (under weak assumptions on the activation function).

Neural network regression is as powerful as regressionwith all known dictionaries, combined and in particular optimal for allcorresponding problem classes!

Page 69: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Two Big Questions

Our approximation result does not require the approximating networkto be very deep!

Big Question 1

What do we gain by using deep networks?

Our result implies that classical approximation results can beemulated by neural networks. However, this only partially explainstheir success in approximating high-dimensionalregression/classification problems!

Big Question 2

Why are neural networks so good at approximating high-dimensionalfunctions?

Page 70: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Two Big Questions

Our approximation result does not require the approximating networkto be very deep!

Big Question 1

What do we gain by using deep networks?

Our result implies that classical approximation results can beemulated by neural networks. However, this only partially explainstheir success in approximating high-dimensionalregression/classification problems!

Big Question 2

Why are neural networks so good at approximating high-dimensionalfunctions?

Page 71: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Two Big Questions

Our approximation result does not require the approximating networkto be very deep!

Big Question 1

What do we gain by using deep networks?

Our result implies that classical approximation results can beemulated by neural networks. However, this only partially explainstheir success in approximating high-dimensionalregression/classification problems!

Big Question 2

Why are neural networks so good at approximating high-dimensionalfunctions?

Page 72: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Two Big Questions

Our approximation result does not require the approximating networkto be very deep!

Big Question 1

What do we gain by using deep networks?

Our result implies that classical approximation results can beemulated by neural networks. However, this only partially explainstheir success in approximating high-dimensionalregression/classification problems!

Big Question 2

Why are neural networks so good at approximating high-dimensionalfunctions?

Page 73: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Two Big Questions

Our approximation result does not require the approximating networkto be very deep!

Big Question 1

What do we gain by using deep networks?

Our result implies that classical approximation results can beemulated by neural networks. However, this only partially explainstheir success in approximating high-dimensionalregression/classification problems!

Big Question 2

Why are neural networks so good at approximating high-dimensionalfunctions?

Page 74: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g(x) =

2x 0 < x < 1/22(1 − x) 1/2 ≤ x < 1

0 else

g(x) = ReLU

(2 · ReLU(x) − 4 · ReLU

(x −

1

2

))

Page 75: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g(x) =

2x 0 < x < 1/22(1 − x) 1/2 ≤ x < 1

0 else

g(x) = ReLU

(2 · ReLU(x) − 4 · ReLU

(x −

1

2

))

Page 76: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g(x) =

2x 0 < x < 1/22(1 − x) 1/2 ≤ x < 1

0 else

g(x) = ReLU

(2 · ReLU(x) − 4 · ReLU

(x −

1

2

))

Page 77: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g ◦ g(x) ”deep” ReLU net

Page 78: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g ◦ g ◦ g ◦ g ◦ g(x)”deep” ReLU net

High-frequency oscillations can be represented by deep neuralnetworks!

Exercise (see also [Telgarsky (2015)])

Shallow networks cannot represent high-frequency oscillations: oneneeds & j layers to capture frequencies oscillating at scale 2−j

Page 79: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g ◦ g ◦ g ◦ g ◦ g(x)”deep” ReLU net

High-frequency oscillations can be represented by deep neuralnetworks!

Exercise (see also [Telgarsky (2015)])

Shallow networks cannot represent high-frequency oscillations: oneneeds & j layers to capture frequencies oscillating at scale 2−j

Page 80: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Simple Example

g ◦ g ◦ g ◦ g ◦ g(x)”deep” ReLU net

High-frequency oscillations can be represented by deep neuralnetworks!

Exercise (see also [Telgarsky (2015)])

Shallow networks cannot represent high-frequency oscillations: oneneeds & j layers to capture frequencies oscillating at scale 2−j

Page 81: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Related Results

[Montufar (2014)] shows that the number of linear regions in ReLUnetworks grows exponentially for deep networks and only polynomiallyfor shallow networks

[Bianchini, Scarselli(2014)] showedanalogous results forBetti numbers of levelsets.

It seems that deep networks are better at approximating highlyoscillating textures with fractal structure, but no precisecharacterization yet!

Page 82: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Related Results

[Montufar (2014)] shows that the number of linear regions in ReLUnetworks grows exponentially for deep networks and only polynomiallyfor shallow networks

[Bianchini, Scarselli(2014)] showedanalogous results forBetti numbers of levelsets.

It seems that deep networks are better at approximating highlyoscillating textures with fractal structure, but no precisecharacterization yet!

Page 83: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Related Results

[Montufar (2014)] shows that the number of linear regions in ReLUnetworks grows exponentially for deep networks and only polynomiallyfor shallow networks

[Bianchini, Scarselli(2014)] showedanalogous results forBetti numbers of levelsets.

It seems that deep networks are better at approximating highlyoscillating textures with fractal structure, but no precisecharacterization yet!

Page 84: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Related Results

[Montufar (2014)] shows that the number of linear regions in ReLUnetworks grows exponentially for deep networks and only polynomiallyfor shallow networks

[Bianchini, Scarselli(2014)] showedanalogous results forBetti numbers of levelsets.

It seems that deep networks are better at approximating highlyoscillating textures with fractal structure, but no precisecharacterization yet!

Page 85: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

High-Dimensional PDEs

Consider for example the Black-Scholes equation where we want tocompute the prize V (0, x) of an option depending on a financialportfolio x ∈ Rd subject to

Vt +div(A(x) ·∇V ) +b(x) ·∇V − f (V ,∇V ) = 0, V (T , x) = g(x),

where g models the prize of of the option g at terminal time T .Typical options (maximum call) are of the form

g(x) = max(d

maxi=1

xi − X , 0).

Page 86: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

High-Dimensional PDEs

Consider for example the Black-Scholes equation where we want tocompute the prize V (0, x) of an option depending on a financialportfolio x ∈ Rd subject to

Vt +div(A(x) ·∇V ) +b(x) ·∇V − f (V ,∇V ) = 0, V (T , x) = g(x),

where g models the prize of of the option g at terminal time T .

Typical options (maximum call) are of the form

g(x) = max(d

maxi=1

xi − X , 0).

Page 87: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

High-Dimensional PDEs

Consider for example the Black-Scholes equation where we want tocompute the prize V (0, x) of an option depending on a financialportfolio x ∈ Rd subject to

Vt +div(A(x) ·∇V ) +b(x) ·∇V − f (V ,∇V ) = 0, V (T , x) = g(x),

where g models the prize of of the option g at terminal time T .Typical options (maximum call) are of the form

g(x) = max(d

maxi=1

xi − X , 0).

Page 88: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Deep DBSE Solver

[E, Han, Jentzen (2017)] solve high-dimensional (>100d) parabolicPDEs using deep neural networks, essentially using a vanillatensorflow implementation and achieving efficiency beyond thecurrent state-of-the-art!

Question

Why does this work so well? Can we establish a neural networkapproximation theory for solutions of PDEs?

Page 89: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Deep DBSE Solver

[E, Han, Jentzen (2017)] solve high-dimensional (>100d) parabolicPDEs using deep neural networks, essentially using a vanillatensorflow implementation and achieving efficiency beyond thecurrent state-of-the-art!

Question

Why does this work so well? Can we establish a neural networkapproximation theory for solutions of PDEs?

Page 90: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Hint

Observation

The maximum call option can be expressed by neural networks with∼ log2(d) and nodes!

Proof:

max(x1, x2, x3, x4) = max(max(x1, x2),max(x3, x4))

andmax(x , y) = x + ReLU(y − x).

Depth is Necessary!

One can show that high-dimensional options cannot be approximatedwell by shallow networks (related methods for circuits in [Hastad(1986)] and [Kane-Williams (2016)])!

Page 91: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Hint

Observation

The maximum call option can be expressed by neural networks with∼ log2(d) and nodes!

Proof:

max(x1, x2, x3, x4) = max(max(x1, x2),max(x3, x4))

andmax(x , y) = x + ReLU(y − x).

Depth is Necessary!

One can show that high-dimensional options cannot be approximatedwell by shallow networks (related methods for circuits in [Hastad(1986)] and [Kane-Williams (2016)])!

Page 92: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

A Hint

Observation

The maximum call option can be expressed by neural networks with∼ log2(d) and nodes!

Proof:

max(x1, x2, x3, x4) = max(max(x1, x2),max(x3, x4))

andmax(x , y) = x + ReLU(y − x).

Depth is Necessary!

One can show that high-dimensional options cannot be approximatedwell by shallow networks (related methods for circuits in [Hastad(1986)] and [Kane-Williams (2016)])!

Page 93: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Related Work

[Mhaskar, Liao, Poggio (2014)] consider compositional functions forexample of the form

f (x1, . . . , x8) =

h3 (h21 (h11(x1, x2), h12(x3, x4)) , h22 (h13(x5, x6), h14(x7, x8))) ,

with hij smooth.

Page 94: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 95: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 96: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 97: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 98: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 99: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 100: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 101: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 102: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 103: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

What we Know

Deep Networks are better than shallow networks if the function to beapproximated is

fractal, texture-like

high-dimensional with compositional structure.

Open Questions

Build general framework/theory!

Which high-dimensional problems have compositional structure?

PDE Regularity theory for neural networks?

How to approximate a large neural network by a smaller one?

How about convolutional networks?

Why do neural networks with SGD generalize despite their hugecapacity?

Page 104: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

2. Stochastic Optimization Algorithms

Page 105: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Vanilla SGD

Let fθ be a neural network with parameters θ.

Empirical Risk minimization seeks the minimizer θ∗ of

θ 7→ 1

n

m∑i=1

|fθ(xi )− yi |2.

SGD computes updates as

θn+1 := θn − νn+1∇θ(|fθ(xin)− yin |2

),

where νn+1 is the learning rate and the indices in are chosenuniformly and independently.

In general nothing can be said about global convergence.

Page 106: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Vanilla SGD

Let fθ be a neural network with parameters θ.

Empirical Risk minimization seeks the minimizer θ∗ of

θ 7→ 1

n

m∑i=1

|fθ(xi )− yi |2.

SGD computes updates as

θn+1 := θn − νn+1∇θ(|fθ(xin)− yin |2

),

where νn+1 is the learning rate and the indices in are chosenuniformly and independently.

In general nothing can be said about global convergence.

Page 107: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Vanilla SGD

Let fθ be a neural network with parameters θ.

Empirical Risk minimization seeks the minimizer θ∗ of

θ 7→ 1

n

m∑i=1

|fθ(xi )− yi |2.

SGD computes updates as

θn+1 := θn − νn+1∇θ(|fθ(xin)− yin |2

),

where νn+1 is the learning rate and the indices in are chosenuniformly and independently.

In general nothing can be said about global convergence.

Page 108: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Vanilla SGD

Let fθ be a neural network with parameters θ.

Empirical Risk minimization seeks the minimizer θ∗ of

θ 7→ 1

n

m∑i=1

|fθ(xi )− yi |2.

SGD computes updates as

θn+1 := θn − νn+1∇θ(|fθ(xin)− yin |2

),

where νn+1 is the learning rate and the indices in are chosenuniformly and independently.

In general nothing can be said about global convergence.

Page 109: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Vanilla SGD

Let fθ be a neural network with parameters θ.

Empirical Risk minimization seeks the minimizer θ∗ of

θ 7→ 1

n

m∑i=1

|fθ(xi )− yi |2.

SGD computes updates as

θn+1 := θn − νn+1∇θ(|fθ(xin)− yin |2

),

where νn+1 is the learning rate and the indices in are chosenuniformly and independently.

In general nothing can be said about global convergence.

Page 110: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Strong Convergence

For SGD applied to a convex problem we have, for appropriatelearning rates

E|θn − θ∗| .1

n1/2−ε ,

and this cannot be improved.

Weak Convergence [G-Jentzen (2017)]

For SGD applied to a convex problem we have, for appropriatelearning rates and every C 2 function ψ

|Eψ(θn)− ψ(θ∗)| .1

n1−ε ,

and this cannot be improved.

Of course neural network ERM is highly non-convex!

Page 111: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Strong Convergence

For SGD applied to a convex problem we have, for appropriatelearning rates

E|θn − θ∗| .1

n1/2−ε ,

and this cannot be improved.

Weak Convergence [G-Jentzen (2017)]

For SGD applied to a convex problem we have, for appropriatelearning rates and every C 2 function ψ

|Eψ(θn)− ψ(θ∗)| .1

n1−ε ,

and this cannot be improved.

Of course neural network ERM is highly non-convex!

Page 112: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Strong Convergence

For SGD applied to a convex problem we have, for appropriatelearning rates

E|θn − θ∗| .1

n1/2−ε ,

and this cannot be improved.

Weak Convergence [G-Jentzen (2017)]

For SGD applied to a convex problem we have, for appropriatelearning rates and every C 2 function ψ

|Eψ(θn)− ψ(θ∗)| .1

n1−ε ,

and this cannot be improved.

Of course neural network ERM is highly non-convex!

Page 113: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Strong Convergence

For SGD applied to a convex problem we have, for appropriatelearning rates

E|θn − θ∗| .1

n1/2−ε ,

and this cannot be improved.

Weak Convergence [G-Jentzen (2017)]

For SGD applied to a convex problem we have, for appropriatelearning rates and every C 2 function ψ

|Eψ(θn)− ψ(θ∗)| .1

n1−ε ,

and this cannot be improved.

Of course neural network ERM is highly non-convex!

Page 114: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 115: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 116: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 117: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 118: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 119: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 120: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 121: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Can we Leverage Convexity?

Non-convexity stems from the parameterization Π(θ) := fθ.

The function F (f ) := 1m

∑mi=1(f (xi )− yi )

2 is convex.

Now suppose for a moment that the set NN of neural networksover which we optimize is convex.

Let θ∗ be a local minimum. Then (if Π is nice) fθ∗ is a localminimum of the convex function F over the convex set NN .

Consequently, fθ∗ is a global minimum of F !

Of course the hypothesis class NN is not convex and we don’tknow anything about Π!

Question

Is every local minimum of the neural network ERM problem also aglobal minimum?

Page 122: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem [Kawaguchi (2016)]

The answer is ’yes’ for linear activation functions.

Observation

If NN is nearly convex (meaning that convex combinations of neuralnetworks can be very well approximated by neural networks) thenevery local minimum is almost a global minimum.

Question

How well can convex combinations of neural networks beapproximated by neural networks of the same size? Does ”almostconvexity” improve with the size?

Page 123: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem [Kawaguchi (2016)]

The answer is ’yes’ for linear activation functions.

Observation

If NN is nearly convex (meaning that convex combinations of neuralnetworks can be very well approximated by neural networks) thenevery local minimum is almost a global minimum.

Question

How well can convex combinations of neural networks beapproximated by neural networks of the same size? Does ”almostconvexity” improve with the size?

Page 124: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem [Kawaguchi (2016)]

The answer is ’yes’ for linear activation functions.

Observation

If NN is nearly convex (meaning that convex combinations of neuralnetworks can be very well approximated by neural networks) thenevery local minimum is almost a global minimum.

Question

How well can convex combinations of neural networks beapproximated by neural networks of the same size? Does ”almostconvexity” improve with the size?

Page 125: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

What is Known

Theorem [Kawaguchi (2016)]

The answer is ’yes’ for linear activation functions.

Observation

If NN is nearly convex (meaning that convex combinations of neuralnetworks can be very well approximated by neural networks) thenevery local minimum is almost a global minimum.

Question

How well can convex combinations of neural networks beapproximated by neural networks of the same size? Does ”almostconvexity” improve with the size?

Page 126: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful. But it is doubtful that it can be castinto a stable algorithm.

Page 127: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful. But it is doubtful that it can be castinto a stable algorithm.

Page 128: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful.

But it is doubtful that it can be castinto a stable algorithm.

Page 129: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful. But it is doubtful that it can be castinto a stable algorithm.

Page 130: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful. But it is doubtful that it can be castinto a stable algorithm.

Page 131: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Understanding the Parameterization Π

Question

How are the parameters θ and the function Π(θ) related?

Theorem [Fefferman (1992)]

For a sigmoidal activation function, the parameters (i.e., thearchitecture and the coefficients) are uniquely determined by fθ, up totrivial symmetries and for almost all networks.

Proof is deep and beautiful. But it is doubtful that it can be castinto a stable algorithm.

Page 132: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 133: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and fun

its problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 134: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematics

most problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 135: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely open

it is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 136: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 137: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 138: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architectures

new algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 139: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 140: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Summary

Deep Learning is a great field of research for mathematicians:

it is highly relevant, exciting and funits problems involve deep mathematicsmost problems are completely openit is interdisciplinary

Possible contributions include:

more informed design choices for network architecturesnew algorithmic paradigms (for example optimal networkcompression or better optimization algorithms)

interaction between different fields will be crucial!

Page 141: Deep Learning as a Mathematician - univie.ac.at · works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. [Ingrid

Thank You!

Questions?