overall error analysis for the training of deep neural ... · 1 introduction deep learning based...

51
Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation Arnulf Jentzen 1 and Timo Welti 2 1 Faculty of Mathematics and Computer Science, University of M¨ unster, unster, Germany; e-mail: ajentzen a uni-muenster.de 2 SAM, Department of Mathematics, ETH Z¨ urich, urich, Switzerland; e-mail: twelti a twelti.org March 4, 2020 Abstract In spite of the accomplishments of deep learning based algorithms in numerous applications and very broad corresponding research interest, at the moment there is still no rigorous understanding of the reasons why such algorithms produce useful results in certain situations. A thorough mathematical analysis of deep learning based algorithms seems to be crucial in order to improve our understanding and to make their implementation more effective and efficient. In this article we pro- vide a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation. The convergence speed we obtain is presumably far from optimal and suffers under the curse of dimensionality. To the best of our knowledge, we establish, however, the first full error analysis in the scientific liter- ature for a deep learning based algorithm in the probabilistically strong sense and, moreover, the first full error analysis in the scientific literature for a deep learning based algorithm where stochastic gradient descent with random initialisation is the employed optimisation method. Keywords: deep learning, deep neural networks, empirical risk minimisation, full error analysis, approximation, generalisation, optimisation, strong convergence, stochastic gradient descent, random initialisation 1 arXiv:2003.01291v1 [math.ST] 3 Mar 2020

Upload: others

Post on 30-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Overall error analysis for the training ofdeep neural networks via stochastic

gradient descent with random initialisation

Arnulf Jentzen1 and Timo Welti2

1 Faculty of Mathematics and Computer Science, University of Munster,

Munster, Germany; e-mail: ajentzen a©uni-muenster.de

2 SAM, Department of Mathematics, ETH Zurich,

Zurich, Switzerland; e-mail: twelti a©twelti.org

March 4, 2020

Abstract

In spite of the accomplishments of deep learning based algorithms in numerousapplications and very broad corresponding research interest, at the moment thereis still no rigorous understanding of the reasons why such algorithms produce usefulresults in certain situations. A thorough mathematical analysis of deep learningbased algorithms seems to be crucial in order to improve our understanding andto make their implementation more effective and efficient. In this article we pro-vide a mathematically rigorous full error analysis of deep learning based empiricalrisk minimisation with quadratic loss function in the probabilistically strong sense,where the underlying deep neural networks are trained using stochastic gradientdescent with random initialisation. The convergence speed we obtain is presumablyfar from optimal and suffers under the curse of dimensionality. To the best of ourknowledge, we establish, however, the first full error analysis in the scientific liter-ature for a deep learning based algorithm in the probabilistically strong sense and,moreover, the first full error analysis in the scientific literature for a deep learningbased algorithm where stochastic gradient descent with random initialisation is theemployed optimisation method.

Keywords: deep learning, deep neural networks, empirical risk minimisation,full error analysis, approximation, generalisation, optimisation, strong

convergence, stochastic gradient descent, random initialisation

1

arX

iv:2

003.

0129

1v1

[m

ath.

ST]

3 M

ar 2

020

Page 2: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Contents

1 Introduction 3

2 Basics on deep neural networks (DNNs) 62.1 Vectorised description of DNNs . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Rectified DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Analysis of the approximation error 83.1 A covering number estimate . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Convergence rates for the approximation error . . . . . . . . . . . . . . . . 10

4 Analysis of the generalisation error 124.1 Monte Carlo estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Uniform strong error estimates for random fields . . . . . . . . . . . . . . . 144.3 Strong convergence rates for the generalisation error . . . . . . . . . . . . . 18

5 Analysis of the optimisation error 235.1 Properties of the gamma and the beta function . . . . . . . . . . . . . . . 245.2 Strong convergence rates for the optimisation error . . . . . . . . . . . . . 27

6 Analysis of the overall error 316.1 Overall error decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Full strong error analysis for the training of DNNs . . . . . . . . . . . . . . 336.3 Full strong error analysis with optimisation via stochastic gradient descent 40

2

Page 3: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

1 Introduction

Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges in many different areas, such as image recognition, natural languageprocessing, game intelligence, autonomous driving, and computational advertising, justto name a few. In line with this, researchers from a wide range of different fields, includ-ing, for example, computer science, mathematics, chemistry, medicine, and finance, areinvesting significant efforts into studying such algorithms and employing them to tacklechallenges arising in their fields. In spite of this broad research interest and the accom-plishments of deep learning based algorithms in numerous applications, at the momentthere is still no rigorous understanding of the reasons why such algorithms produce usefulresults in certain situations. Consequently, there is no rigorous way to predict, before ac-tually implementing a deep learning based algorithm, in which situations it might performreliably and in which situations it might fail. This necessitates in many cases a trial-and-error approach in order to move forward, which can cost a lot of time and resources. Athorough mathematical analysis of deep learning based algorithms (in scenarios where itis possible to formulate such an analysis) seems to be crucial in order to make progress onthese issues. Moreover, such an analysis may lead to new insights that enable the designof more effective and efficient algorithms.

The aim of this article is to provide a mathematically rigorous full error analysis of deeplearning based empirical risk minimisation with quadratic loss function in the probabilis-tically strong sense, where the underlying deep neural networks (DNNs) are trained usingstochastic gradient descent (SGD) with random initialisation (cf. Theorem 1.1 below).For a brief illustration of deep learning based empirical risk minimisation with quadraticloss function, consider natural numbers d,d ∈ N, a probability space (Ω,F ,P), randomvariables X : Ω→ [0, 1]d and Y : Ω→ [0, 1], and a measurable function E : [0, 1]d → [0, 1]satisfying P-a.s. that E(X) = E[Y |X]. The goal is to find a DNN with appropriate ar-chitecture and appropriate parameter vector θ ∈ Rd (collecting its weights and biases)such that its realisation Nθ : Rd → R approximates the target function E well in thesense that the error E[|Nθ(X)−E(X)|p] =

∫[0,1]d|Nθ(x)−E(x)|p PX(dx) ∈ [0,∞) for some

p ∈ [1,∞) is as small as possible. In other words, given X we want Nθ(X) to predict Y asreliably as possible. Due to the well-known bias–variance decomposition (cf., e.g., Beck,Jentzen, & Kuckuck [10, Lemma 4.1]), for the case p = 2 minimising the error functionRd 3 θ 7→ E[|Nθ(X) − E(X)|2] ∈ [0,∞) is equivalent to minimising the risk functionRd 3 θ 7→ E[|Nθ(X) − Y |2] ∈ [0,∞) (corresponding to a quadratic loss function). Sincein practice the joint distribution of X and Y is typically not known, the risk function isreplaced by an empirical risk function based on i.i.d. training samples of (X, Y ). Thisempirical risk is then approximatively minimised using an optimisation method such asSGD. As is often the case for deep learning based algorithms, the overall error arisingfrom this procedure consists of the following three different parts (cf. [10, Lemma 4.3]and Proposition 6.1 below): (i) the approximation error (cf., e.g., [5, 6, 14, 21, 24, 37,39, 54–58, 66, 75] and the references in the introductory paragraph in Section 3), whicharises from approximating the target function E by the considered class of DNNs, (ii) thegeneralisation error (cf., e.g., [7, 10, 13, 23, 31–33, 52, 67, 87, 92]), which arises fromreplacing the true risk by the empirical risk, and (iii) the optimisation error (cf., e.g., [2,4, 8, 10, 12, 18, 25, 26, 28, 29, 38, 60, 62, 63, 65, 88, 97, 98]), which arises from computingonly an approximate minimiser using the selected optimisation method.

In this work we derive strong convergence rates for the approximation error, the gen-eralisation error, and the optimisation error separately and combine these findings to

3

Page 4: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

establish strong convergence results for the overall error (cf. Subsections 6.2 and 6.3),as illustrated in Theorem 1.1 below. The convergence speed we obtain (cf. (4) in The-orem 1.1) is presumably far from optimal, suffers under the curse of dimensionality (cf.,e.g., Bellman [11] and Novak & Wozniakowski [73, Chapter 1; 74, Chapter 9]), and is,as a consequence, very slow. To the best of our knowledge, Theorem 1.1 is, however,the first full error result in the scientific literature for a deep learning based algorithmin the probabilistically strong sense and, moreover, the first full error result in the scien-tific literature for a deep learning based algorithm where SGD with random initialisationis the employed optimisation method. We now present Theorem 1.1, the statement ofwhich is entirely self-contained, before we add further explanations and intuitions for themathematical objects that are introduced.

Theorem 1.1. Let d,d,L,J,M,K,N ∈ N, γ, L ∈ R, c ∈ [max2, L,∞), l = (l0, . . . , lL)∈ NL+1, N ⊆ 0, . . . , N, assume 0 ∈ N, l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), for

every m,n ∈ N, s ∈ N0, θ = (θ1, . . . , θd) ∈ Rd with d ≥ s+mn+m let Aθ,sm,n : Rn → Rm

satisfy for all x = (x1, . . . , xn) ∈ Rn that

Aθ,sm,n(x) =

θs+1 θs+2 · · · θs+nθs+n+1 θs+n+2 · · · θs+2n

......

. . ....

θs+(m−1)n+1 θs+(m−1)n+2 · · · θs+mn

x1

x2...xn

+

θs+mn+1

θs+mn+2...

θs+mn+m

, (1)

let ai : Rli → Rli, i ∈ 1, . . . ,L, satisfy for all i ∈ N ∩ [0,L), x = (x1, . . . , xli) ∈Rli that ai(x) = (maxx1, 0, . . . ,maxxli , 0), assume for all x ∈ R that aL(x) =maxminx, 1, 0, for every θ ∈ Rd let Nθ : Rd → R satisfy Nθ = aL Aθ,

∑L−1i=1 li(li−1+1)

lL,lL−1

aL−1 Aθ,∑L−2i=1 li(li−1+1)

lL−1,lL−2 . . . a1 Aθ,0l1,l0

, let (Ω,F ,P) be a probability space, let Xk,nj : Ω→

[0, 1]d, k, n, j ∈ N0, and Y k,nj : Ω → [0, 1], k, n, j ∈ N0, be functions, assume that

(X0,0j , Y 0,0

j ), j ∈ N, are i.i.d. random variables, let E : [0, 1]d → [0, 1] satisfy P-a.s.that E(X0,0

1 ) = E[Y 0,01 |X0,0

1 ], assume for all x, y ∈ [0, 1]d that |E(x) − E(y)| ≤ L‖x −y‖1, let Θk,n : Ω → Rd, k, n ∈ N0, and k : Ω → (N0)2 be random variables, assume(⋃∞

k=1 Θk,0(Ω))⊆ [−c, c]d, assume that Θk,0, k ∈ N, are i.i.d., assume that Θ1,0 is

continuous uniformly distributed on [−c, c]d, let Rk,nJ : Rd × Ω → [0,∞), k, n, J ∈ N0,

and Gk,n : Rd × Ω → Rd, k, n ∈ N, satisfy for all k, n ∈ N, ω ∈ Ω, θ ∈ ϑ ∈Rd : (Rk,n

J (·, ω) : Rd → [0,∞) is differentiable at ϑ) that Gk,n(θ, ω) = (∇θRk,nJ )(θ, ω),

assume for all k, n ∈ N that Θk,n = Θk,n−1− γGk,n(Θk,n−1), and assume for all k, n ∈ N0,J ∈ N, θ ∈ Rd, ω ∈ Ω that

Rk,nJ (θ, ω) =

1

J

[J∑j=1

|Nθ(Xk,nj (ω))− Y k,n

j (ω)|2]

and (2)

k(ω) ∈ arg min(l,m)∈1,...,K×N, ‖Θl,m(ω)‖∞≤cR0,0M (Θl,m(ω), ω). (3)

Then

E[∫

[0,1]d|NΘk

(x)− E(x)|PX0,01

(dx)]

≤ dc3

[minL, l1, . . . , lL−1]1/d+c3L(‖l‖∞ + 1) ln(eM)

M 1/4+

L(‖l‖∞ + 1)LcL+1

K [(2L)−1(‖l‖∞+1)−2].

(4)

Recall that we denote for every p ∈ [1,∞] by ‖·‖p :(⋃∞

n=1 Rn)→ [0,∞) the p-norm of

vectors in⋃∞n=1 Rn (cf. Definition 3.1). In addition, note that the function Ω × [0, 1]d 3

4

Page 5: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

(ω, x) 7→ |NΘk(ω)(ω)(x) − E(x)| ∈ [0,∞) is measurable (cf. Lemma 6.2) and that theexpression on the left hand side of (4) above is thus well-defined. Theorem 1.1 followsdirectly from Corollary 6.9 in Subsection 6.3, which, in turn, is a consequence of the mainresult of this article, Theorem 6.5 in Subsection 6.2.

In the following we provide additional explanations and intuitions for Theorem 1.1.For every θ ∈ Rd the functions Nθ : Rd → R are realisations of fully connected feedforwardartificial neural networks with L+1 layers consisting of an input layer of dimension l0 = d,of L − 1 hidden layers of dimensions l1, . . . , lL−1, respectively, and of an output layer ofdimension lL = 1 (cf. Definition 2.8). The weights and biases stored in the DNN param-eter vector θ ∈ Rd determine the corresponding L affine linear transformations (cf. (1)above). As activation functions we employ the multidimensional versions a1, . . . , aL−1

(cf. Definition 2.3) of the rectifier function R 3 x 7→ maxx, 0 ∈ R (cf. Definition 2.4)just in front of each of the hidden layers and the clipping function aL (cf. Definition 2.6)just in front of the output layer. Furthermore, observe that we assume the target func-tion E : [0, 1]d → [0, 1], the values of which we intend to approximately predict with thetrained DNN, to be Lipschitz continuous with Lipschitz constant L. Moreover, for everyk, n ∈ N0, J ∈ N the function Rk,n

J : Rd×Ω→ [0,∞) is the empirical risk based on the Jtraining samples (Xk,n

j , Y k,nj ), j ∈ 1, . . . , J (cf. (2) above). Derived from the empirical

risk, for every k, n ∈ N the function Gk,n : Rd × Ω → Rd is a (generalised) gradient ofthe empirical risk Rk,n

J with respect to its first argument, that is, with respect to theDNN parameter vector θ ∈ Rd. These gradients are required in order to formulate thetraining dynamics of the (random) DNN parameter vectors Θk,n ∈ Rd, k ∈ N, n ∈ N0,given by the SGD optimisation method with learning rate γ. Note that the subscriptn ∈ N0 of these SGD iterates (i.e., DNN parameter vectors) is the current training stepnumber, whereas the subscript k ∈ N counts the number of times the SGD iterationhas been started from scratch so far. Such a new start entails the corresponding initialDNN parameter vector Θk,0 ∈ Rd to be drawn continuous uniformly from the hypercube[−c, c]d, in accordance with Xavier initialisation (cf. Glorot & Bengio [41]). The (ran-dom) double index k ∈ N × N0 represents the final choice made for the DNN parametervector Θk ∈ Rd (cf. (4) above), concluding the training procedure, and is selected asfollows. During training the empirical risk R0,0

M has been calculated for the subset of theSGD iterates indexed by N ⊆ 0, . . . , N provided that they have not left the hypercube[−c, c]d (cf. (3) above). After the SGD iteration has been started and finished K times(with maximally N training steps in each case) the final choice for the DNN parametervector Θk ∈ Rd is made among those SGD iterates for which the calculated empirical riskis minimal (cf. (3) above). Observe that we use mini-batches of size J consisting, duringSGD iteration number k ∈ 1, . . . , K for training step number n ∈ 1, . . . , N, of thetraining samples (Xk,n

j , Y k,nj ), j ∈ 1, . . . ,J, and that we reserve the M training samples

(X0,0j , Y 0,0

j ), j ∈ 1, . . . ,M, for checking the value of the empirical risk R0,0M . Regarding

the conclusion of Theorem 1.1, note that the left hand side of (4) is the expectation of theoverall L1-error, that is, the expected L1-distance between the trained DNN NΘk and thetarget function E . It is bounded from above by the right hand side of (4), which consistsof following three summands: (i) the first summand corresponds to the approximationerror and converges to zero as the number of hidden layers L − 1 as well as the hiddenlayer dimensions l1, . . . , lL−1 increase to infinity, (ii) the second summand corresponds tothe generalisation error and converges to zero as number of training samples M usedfor calculating the empirical risk increases to infinity, and (iii) the third summand corre-sponds to the optimisation error and converges to zero as total number of times K theSGD iteration has been started from scratch increases to infinity. We would like to point

5

Page 6: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

out that the the second summand (corresponding to the generalisation error) does notsuffer under the curse of dimensionality with respect to any of the variables involved.

The main result of this article, Theorem 6.5 in Subsection 6.2, covers, in comparisonwith Theorem 1.1, the more general cases where Lp-norms of the overall L2-error insteadof the expectation of the overall L1-error are considered (cf. (167) in Theorem 6.5), wherethe training samples are not restricted to unit hypercubes, and where a general stochasticapproximation algorithm (cf., e.g., Robbins & Monro [83]) with random initialisation isused for optimisation. Our convergence proof for the optimisation error relies, in fact, onthe convergence of the Minimum Monte Carlo method (cf. Proposition 5.6 in Section 5)and thus only exploits random initialisation but not the specific dynamics of the employedoptimisation method (cf. (155) in the proof of Proposition 6.3). In this regard, note thatTheorem 1.1 above also includes the application of deterministic gradient descent insteadof SGD for optimisation since we do not assume the samples used for gradient iterationsto be i.i.d. Parts of our derivation of Theorem 1.1 and Theorem 6.5, respectively, areinspired by Beck, Jentzen, & Kuckuck [10], Berner, Grohs, & Jentzen [13], and Cucker &Smale [23].

This article is structured in the following way. Section 2 recalls some basic definitionsrelated to DNNs and thereby introduces the corresponding notation we use in the sub-sequent parts of this article. In Section 3 we examine the approximation error and, inparticular, establish a convergence result for the approximation of Lipschitz continuousfunctions by DNNs. The following section, Section 4, contains our strong convergenceanalysis of the generalisation error. In Section 5, in turn, we address the optimisationerror and derive in connection with this strong convergence rates for the Minimum MonteCarlo method. Finally, we combine in Section 6 a decomposition of the overall error (cf.Subsection 6.1) with our results for the different error sources from Sections 3, 4, and 5 toprove strong convergence results for the overall error. The employed optimisation methodis initially allowed to be a general stochastic approximation algorithm with random ini-tialisation (cf. Subsection 6.2) and is afterwards specialised to the setting of SGD withrandom initialisation (cf. Subsection 6.3).

2 Basics on deep neural networks (DNNs)

In this section we present the mathematical description of DNNs which we use throughoutthe remainder of this article. It is a vectorised description in the sense that all the weightsand biases associated to the DNN under consideration are collected in a single parametervector θ ∈ Rd with d ∈ N sufficiently large (cf. Definitions 2.2 and 2.8). The contentof this section is taken from Beck, Jentzen, & Kuckuck [10, Section 2.1] and is basedon well-known material from the scientific literature, see, e.g., Beck et al. [8], Beck, E, &Jentzen [9], Berner, Grohs, & Jentzen [13], E, Han, & Jentzen [30], Goodfellow, Bengio, &Courville [43], and Grohs et al. [46]. In particular, Definition 2.1 is [10, Definition 2.1] (cf.,e.g., (25) in [9]), Definition 2.2 is [10, Definition 2.2] (cf., e.g., (26) in [9]), Definition 2.3is [10, Definition 2.3] (cf., e.g., [46, Definition 2.2]), and Definitions 2.4, 2.5, 2.6, 2.7,and 2.8 are [10, Definitions 2.4, 2.5, 2.6, 2.7, and 2.8] (cf., e.g., [13, Setting 2.5] and [43,Section 6.3]).

2.1 Vectorised description of DNNs

Definition 2.1 (Affine function). Let d,m, n ∈ N, s ∈ N0, θ = (θ1, θ2, . . . , θd) ∈ Rd

satisfy d ≥ s+mn+m. Then we denote by Aθ,sm,n : Rn → Rm the function which satisfies

6

Page 7: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

for all x = (x1, x2, . . . , xn) ∈ Rn that

Aθ,sm,n(x) =

θs+1 θs+2 · · · θs+nθs+n+1 θs+n+2 · · · θs+2n

θs+2n+1 θs+2n+2 · · · θs+3n...

.... . .

...θs+(m−1)n+1 θs+(m−1)n+2 · · · θs+mn

x1

x2

x3...xn

+

θs+mn+1

θs+mn+2

θs+mn+3...

θs+mn+m

(5)

=

([n∑i=1

θs+ixi

]+ θs+mn+1,

[n∑i=1

θs+n+ixi

]+ θs+mn+2, . . . ,

[n∑i=1

θs+(m−1)n+ixi

]+ θs+mn+m

).

Definition 2.2 (Fully connected feedforward artificial neural network). Let d,L, l0, l1, . . . ,lL ∈ N, s ∈ N0, θ ∈ Rd satisfy d ≥ s +

∑Li=1 li(li−1 + 1) and let ai : Rli → Rli,

i ∈ 1, 2, . . . ,L, be functions. Then we denote by N θ,s,l0a1,a2,...,aL

: Rl0 → RlL the functionwhich satisfies for all x ∈ Rl0 that(N θ,s,l0

a1,a2,...,aL

)(x) =

(aL A

θ,s+∑L−1i=1 li(li−1+1)

lL,lL−1 aL−1 A

θ,s+∑L−2i=1 li(li−1+1)

lL−1,lL−2 . . .

. . . a2 Aθ,s+l1(l0+1)l2,l1

a1 Aθ,sl1,l0

)(x)

(6)

(cf. Definition 2.1).

2.2 Activation functions

Definition 2.3 (Multidimensional version). Let d ∈ N and let a : R → R be a func-tion. Then we denote by Ma,d : Rd → Rd the function which satisfies for all x =(x1, x2, . . . , xd) ∈ Rd that

Ma,d(x) = (a(x1), a(x2), . . . , a(xd)). (7)

Definition 2.4 (Rectifier function). We denote by r : R→ R the function which satisfiesfor all x ∈ R that

r(x) = maxx, 0. (8)

Definition 2.5 (Multidimensional rectifier function). Let d ∈ N. Then we denote byRd : Rd → Rd the function given by

Rd = Mr,d (9)

(cf. Definitions 2.3 and 2.4).

Definition 2.6 (Clipping function). Let u ∈ [−∞,∞), v ∈ (u,∞]. Then we denote bycu,v : R→ R the function which satisfies for all x ∈ R that

cu,v(x) = maxu,minx, v. (10)

Definition 2.7 (Multidimensional clipping function). Let d ∈ N, u ∈ [−∞,∞), v ∈(u,∞]. Then we denote by Cu,v,d : Rd → Rd the function given by

Cu,v,d = Mcu,v ,d (11)

(cf. Definitions 2.3 and 2.6).

7

Page 8: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

2.3 Rectified DNNs

Definition 2.8 (Rectified clipped DNN). Let d,L ∈ N, u ∈ [−∞,∞), v ∈ (u,∞],l = (l0, l1, . . . , lL) ∈ NL+1, θ ∈ Rd satisfy d ≥

∑Li=1 li(li−1 + 1). Then we denote by

N θ,lu,v : Rl0 → RlL the function which satisfies for all x ∈ Rl0 that

N θ,lu,v (x) =

(N θ,0,l0

Cu,v,lL

)(x) : L = 1(

N θ,0,l0Rl1

,Rl2,...,RlL−1

,Cu,v,lL

)(x) : L > 1

(12)

(cf. Definitions 2.2, 2.5, and 2.7).

3 Analysis of the approximation error

This section is devoted to establishing a convergence result for the approximation of Lips-chitz continuous functions by DNNs (cf. Proposition 3.5). More precisely, Proposition 3.5establishes that a Lipschitz continuous function defined on a d-dimensional hypercubefor d ∈ N can be approximated by DNNs with convergence rate 1/d with respect to aparameter A ∈ (0,∞) that bounds the architecture size (that is, depth and width) of theapproximating DNN from below. Key ingredients of the proof of Proposition 3.5 are Beck,Jentzen, & Kuckuck [10, Corollary 3.8] as well as the elementary covering number estimatein Lemma 3.3. In order to improve the accessibility of Lemma 3.3, we recall the definitionof covering numbers associated to a metric space in Definition 3.2, which is [10, Defini-tion 3.11]. Lemma 3.3 provides upper bounds for the covering numbers of hypercubesequipped with the metric induced by the p-norm (cf. Definition 3.1) for p ∈ [1,∞] and isa generalisation of Berner, Grohs, & Jentzen [13, Lemma 2.7] (cf. Cucker & Smale [23,Proposition 5] and [10, Proposition 3.12]). Furthermore, we present in Lemma 3.4 an ele-mentary upper bound for the error arising when Lipschitz continuous functions defined ona hypercube are approximated by certain DNNs. Additional DNN approximation resultscan be found, e.g., in [3, 5, 6, 14–17, 19–21, 24, 27, 34–37, 39, 44–51, 53–59, 61, 64, 66,68–72, 75–80, 82, 84–86, 89–91, 93, 95, 96] and the references therein.

3.1 A covering number estimate

Definition 3.1 (p-norm). We denote by ‖·‖p :(⋃∞

d=1 Rd)→ [0,∞), p ∈ [1,∞], the func-

tions which satisfy for all p ∈ [1,∞), d ∈ N, θ = (θ1, θ2, . . . , θd) ∈ Rd that

‖θ‖p =

(d∑i=1

|θi|p)1/p

and ‖θ‖∞ = maxi∈1,2,...,d

|θi|. (13)

Definition 3.2 (Covering number). Let (E, δ) be a metric space and let r ∈ [0,∞]. Thenwe denote by C(E,δ),r ∈ N0∪∞ (we denote by CE,r ∈ N0∪∞) the extended real numbergiven by

C(E,δ),r = inf

(n ∈ N0 :

[∃A ⊆ E :

((|A| ≤ n) ∧ (∀x ∈ E :∃ a ∈ A : δ(a, x) ≤ r)

)]∪ ∞

). (14)

Lemma 3.3. Let d ∈ N, a ∈ R, b ∈ (a,∞), r ∈ (0,∞), for every p ∈ [1,∞] letδp : ([a, b]d)× ([a, b]d)→ [0,∞) satisfy for all x, y ∈ [a, b]d that δp(x, y) = ‖x− y‖p, and letd·e : [0,∞)→ N0 satisfy for all x ∈ [0,∞) that dxe = min([x,∞)∩N0) (cf. Definition 3.1).Then

8

Page 9: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

(i) it holds for all p ∈ [1,∞) that

C([a,b]d,δp),r ≤(⌈

d1/p(b−a)

2r

⌉)d≤

1 : r ≥ d(b−a)/2(d(b−a)

r

)d: r < d(b−a)/2

(15)

and

(ii) it holds that

C([a,b]d,δ∞),r ≤(⌈

b−a2r

⌉)d ≤ 1 : r ≥ (b−a)/2(b−ar

)d: r < (b−a)/2

(16)

(cf. Definition 3.2).

Proof of Lemma 3.3. Throughout this proof let (Np)p∈[1,∞] ⊆ N satisfy for all p ∈ [1,∞)that

Np =⌈d1/p(b−a)

2r

⌉and N∞ =

⌈b−a2r

⌉, (17)

for every N ∈ N, i ∈ 1, 2, . . . , N let gN,i ∈ [a, b] be given by gN,i = a+ (i−1/2)(b−a)/N, andfor every p ∈ [1,∞] let Ap ⊆ [a, b]d be given by Ap = gNp,1, gNp,2, . . . , gNp,Npd. Observethat it holds for all N ∈ N, i ∈ 1, 2, . . . , N, x ∈ [a+ (i−1)(b−a)/N, gN,i] that

|x− gN,i| = a+ (i−1/2)(b−a)N

− x ≤ a+ (i−1/2)(b−a)N

−(a+ (i−1)(b−a)

N

)= b−a

2N. (18)

In addition, note that it holds for all N ∈ N, i ∈ 1, 2, . . . , N, x ∈ [gN,i, a+ i(b−a)/N] that

|x− gN,i| = x−(a+ (i−1/2)(b−a)

N

)≤ a+ i(b−a)

N−(a+ (i−1/2)(b−a)

N

)= b−a

2N. (19)

Combining (18) and (19) implies for all N ∈ N, i ∈ 1, 2, . . . , N, x ∈ [a+ (i−1)(b−a)/N, a+i(b−a)/N] that |x − gN,i| ≤ (b−a)/(2N). This proves that for every N ∈ N, x ∈ [a, b] thereexists y ∈ gN,1, gN,2, . . . , gN,N such that

|x− y| ≤ b−a2N. (20)

This, in turn, establishes that for every p ∈ [1,∞), x = (x1, x2, . . . , xd) ∈ [a, b]d thereexists y = (y1, y2, . . . , yd) ∈ Ap such that

δp(x, y) = ‖x− y‖p =

(d∑i=1

|xi− yi|p)1/p≤(

d∑i=1

(b−a)p

(2Np)p

)1/p= d

1/p(b−a)2Np

≤ d1/p(b−a)2r

2d1/p(b−a)= r. (21)

Furthermore, again (20) shows that for every x = (x1, x2, . . . , xd) ∈ [a, b]d there existsy = (y1, y2, . . . , yd) ∈ A∞ such that

δ∞(x, y) = ‖x− y‖∞ = maxi∈1,2,...,d

|xi − yi| ≤ b−a2N∞≤ (b−a)2r

2(b−a)= r. (22)

Note that (21), (17), and the fact that ∀x ∈ [0,∞) : dxe ≤ 1(0,1](x) + 2x1(1,∞)(x) =1(0,r](rx) + 2x1(r,∞)(rx) yield for all p ∈ [1,∞) that

C([a,b]d,δp),r ≤ |Ap| = (Np)d =

(⌈d1/p(b−a)

2r

⌉)d≤(⌈

d(b−a)2r

⌉)d≤(1(0,r]

(d(b−a)2

)+ 2d(b−a)

2r1(r,∞)

(d(b−a)2

))d= 1(0,r]

(d(b−a)2

)+(d(b−a)

r

)d1(r,∞)

(d(b−a)2

).

(23)

This proves (i). In addition, (22), (17), and again the fact that ∀x ∈ [0,∞) : dxe ≤1(0,r](rx) + 2x1(r,∞)(rx) demonstrate that

C([a,b]d,δ∞),r ≤ |A∞| = (N∞)d =(⌈

b−a2r

⌉)d ≤ 1(0,r]

(b−a

2

)+(b−ar

)d1(r,∞)

(b−a

2

). (24)

This implies (ii) and thus completes the proof of Lemma 3.3.

9

Page 10: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

3.2 Convergence rates for the approximation error

Lemma 3.4. Let d,d,L ∈ N, L, a ∈ R, b ∈ (a,∞), u ∈ [−∞,∞), v ∈ (u,∞],l = (l0, l1, . . . , lL) ∈ NL+1, assume l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), and

let f : [a, b]d → ([u, v] ∩ R) satisfy for all x, y ∈ [a, b]d that |f(x) − f(y)| ≤ L‖x − y‖1

(cf. Definition 3.1). Then there exists ϑ ∈ Rd such that ‖ϑ‖∞ ≤ supx∈[a,b]d |f(x)| and

supx∈[a,b]d |N ϑ,lu,v (x)− f(x)| ≤ dL(b− a)

2(25)

(cf. Definition 2.8).

Proof of Lemma 3.4. Throughout this proof let d ∈ N be given by d =∑L

i=1 li(li−1 + 1),let m = (m1,m2, . . . ,md) ∈ [a, b]d satisfy for all i ∈ 1, 2, . . . , d that mi = (a+b)/2, and letϑ = (ϑ1, ϑ2, . . . , ϑd) ∈ Rd satisfy for all i ∈ 1, 2, . . . ,d\d that ϑi = 0 and ϑd = f(m).Observe that the assumption that lL = 1 and the fact that ∀ i ∈ 1, 2, . . . , d− 1 : ϑi = 0show for all x = (x1, x2, . . . , xlL−1

) ∈ RlL−1 that

Aϑ,∑L−1i=1 li(li−1+1)

1,lL−1(x) =

[lL−1∑i=1

ϑ[∑L−1i=1 li(li−1+1)]+ixi

]+ ϑ[

∑L−1i=1 li(li−1+1)]+lL−1+1

=

[lL−1∑i=1

ϑ[∑Li=1 li(li−1+1)]−(lL−1−i+1)xi

]+ ϑ∑L

i=1 li(li−1+1)

=

[lL−1∑i=1

ϑd−(lL−1−i+1)xi

]+ ϑd = ϑd = f(m)

(26)

(cf. Definition 2.1). Combining this with the fact that f(m) ∈ [u, v] ensures for allx ∈ RlL−1 that(

Cu,v,lL Aϑ,∑L−1i=1 li(li−1+1)

lL,lL−1

)(x) =

(Cu,v,1 A

ϑ,∑L−1i=1 li(li−1+1)

1,lL−1

)(x) = cu,v(f(m))

= maxu,minf(m), v = maxu, f(m) = f(m)(27)

(cf. Definitions 2.6 and 2.7). This proves for all x ∈ Rd that

N ϑ,lu,v (x) = f(m). (28)

In addition, note that it holds for all x ∈ [a,m1], x ∈ [m1, b] that |m1 − x| = m1 − x =(a+b)/2− x ≤ (a+b)/2− a = (b−a)/2 and |m1− x| = x−m1 = x− (a+b)/2 ≤ b− (a+b)/2 = (b−a)/2.The assumption that ∀x, y ∈ [a, b]d : |f(x)−f(y)| ≤ L‖x−y‖1 and (28) hence demonstratefor all x = (x1, x2, . . . , xd) ∈ [a, b]d that

|N ϑ,lu,v (x)− f(x)| = |f(m)− f(x)| ≤ L‖m− x‖1 = L

d∑i=1

|mi − xi|

= Ld∑i=1

|m1 − xi| ≤d∑i=1

L(b− a)

2=dL(b− a)

2.

(29)

This and the fact that ‖ϑ‖∞ = maxi∈1,2,...,d|ϑi| = |f(m)| ≤ supx∈[a,b]d |f(x)| completethe proof of Lemma 3.4.

Proposition 3.5. Let d,d,L ∈ N, A ∈ (0,∞), L, a ∈ R, b ∈ (a,∞), u ∈ [−∞,∞),v ∈ (u,∞], l = (l0, l1, . . . , lL) ∈ NL+1, assume L ≥ A1(6d,∞)(A)/(2d) + 1, l0 = d, l1 ≥A1(6d,∞)(A), lL = 1, and d ≥

∑Li=1 li(li−1 + 1), assume for all i ∈ 2, 3, . . . ∩ [0,L)

10

Page 11: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

that li ≥ 1(6d,∞)(A) maxA/d − 2i + 3, 2, and let f : [a, b]d → ([u, v] ∩ R) satisfy for allx, y ∈ [a, b]d that |f(x)− f(y)| ≤ L‖x− y‖1 (cf. Definition 3.1). Then there exists ϑ ∈ Rd

such that ‖ϑ‖∞ ≤ max1, L, |a|, |b|, 2[supx∈[a,b]d |f(x)|] and

supx∈[a,b]d |N ϑ,lu,v (x)− f(x)| ≤ 3dL(b− a)

A1/d(30)

(cf. Definition 2.8).

Proof of Proposition 3.5. Throughout this proof assume w.l.o.g. that A > 6d (cf. Lem-ma 3.4), let N ∈ N be given by

N = maxn ∈ N : n ≤

(A2d

)1/d, (31)

let r ∈ (0,∞) be given by r = d(b−a)/(2N), let δ : ([a, b]d)× ([a, b]d)→ [0,∞) satisfy for allx, y ∈ [a, b]d that δ(x, y) = ‖x− y‖1, let D ⊆ [a, b]d satisfy |D | = max2, C([a,b]d,δ),r and

supx∈[a,b]d infy∈D δ(x, y) ≤ r (32)

(cf. Definition 3.2), and let d·e : [0,∞) → N0 satisfy for all x ∈ [0,∞) that dxe =min([x,∞) ∩ N0). Note that it holds for all d ∈ N that

2d ≤ 2 · 2d−1 = 2d. (33)

This implies that 3d = 6d/2d ≤ A/(2d). Equation (31) hence demonstrates that

2 ≤ 23

(A2d

)1/d=(A2d

)1/d − 13

(A2d

)1/d ≤ ( A2d

)1/d − 1 < N. (34)

This and (i) in Lemma 3.3 (with δ1 ← δ, p ← 1 in the notation of (i) in Lemma 3.3)establish that

|D | = max2, C([a,b]d,δ),r ≤ max

2,(⌈

d(b−a)2r

⌉)d= max2, (dNe)d = Nd. (35)

Combining this with (31) proves that

4 ≤ 2d|D | ≤ 2dNd ≤ 2dA2d

= A. (36)

The fact that L ≥ A1(6d,∞)(A)/(2d) + 1 = A/(2d) + 1 hence yields that |D | ≤ A/(2d) ≤ L − 1.This, (36), and the facts that l1 ≥ A1(6d,∞)(A) = A and ∀ i ∈ 2, 3, . . . ∩ [0,L) =2, 3, . . . ,L− 1 : li ≥ 1(6d,∞)(A) maxA/d− 2i+ 3, 2 = maxA/d− 2i+ 3, 2 imply for alli ∈ 2, 3, . . . , |D | that

L ≥ |D |+ 1, l1 ≥ A ≥ 2d|D |, and li ≥ A/d− 2i+ 3 ≥ 2|D | − 2i+ 3. (37)

In addition, the fact that ∀ i ∈ 2, 3, . . .∩ [0,L) : li ≥ maxA/d− 2i+ 3, 2 ensures for alli ∈ N ∩ (|D |,L) that

li ≥ 2. (38)

Furthermore, observe that it holds for all x = (x1, x2, . . . , xd), y = (y1, y2, . . . , yd) ∈ [a, b]d

that

|f(x)− f(y)| ≤ L‖x− y‖1 = L

[d∑i=1

|xi − yi|]. (39)

11

Page 12: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

This, the assumptions that l0 = d, lL = 1, and d ≥∑L

i=1 li(li−1 + 1), (37)–(38), and Beck,Jentzen, & Kuckuck [10, Corollary 3.8] (with d ← d, d ← d, L ← L, L ← L, u ← u,v ← v, D ← [a, b]d, f ← f , M ← D , l ← l in the notation of [10, Corollary 3.8]) showthat there exists ϑ ∈ Rd such that ‖ϑ‖∞ ≤ max1, L, supx∈D‖x‖∞, 2[supx∈D |f(x)|] and

supx∈[a,b]d

|N ϑ,lu,v (x)− f(x)| ≤ 2L

[sup

x=(x1,x2,...,xd)∈[a,b]d

(inf

y=(y1,y2,...,yd)∈D

d∑i=1

|xi − yi|)]

= 2L

[sup

x∈[a,b]dinfy∈D‖x− y‖1

]= 2L

[sup

x∈[a,b]dinfy∈D

δ(x, y)

].

(40)

Note that this demonstrates that

‖ϑ‖∞ ≤ max1, L, |a|, |b|, 2[supx∈[a,b]d |f(x)|]. (41)

Moreover, (40) and (32)–(34) prove that

supx∈[a,b]d|N ϑ,lu,v (x)− f(x)| ≤ 2L

[supx∈[a,b]d infy∈D δ(x, y)

]≤ 2Lr

=dL(b− a)

N≤ dL(b− a)

23

(A2d

)1/d =(2d)1/d3dL(b− a)

2A1/d≤ 3dL(b− a)

A1/d.

(42)

Combining this with (41) completes the proof of Proposition 3.5.

4 Analysis of the generalisation error

In this section we consider the worst-case generalisation error arising in deep learningbased empirical risk minimisation with quadratic loss function for DNNs with a fixedarchitecture and weights and biases bounded in size by a fixed constant (cf. Corollary 4.15in Subsection 4.3). We prove that this worst-case generalisation error converges in theprobabilistically strong sense with rate 1/2 (up to a logarithmic factor) with respect tothe number of samples used for calculating the empirical risk and that the constant inthe corresponding upper bound for the worst-case generalisation error scales favourably(i.e., only very moderately) in terms of depth and width of the DNNs employed, cf. (ii)in Corollary 4.15. Corollary 4.15 is a consequence of the main result of this section,Proposition 4.14 in Subsection 4.3, which provides a similar conclusion in a more generalsetting. The proofs of Proposition 4.14 and Corollary 4.15, respectively, rely on the toolsdeveloped in the two preceding subsections, Subsections 4.1 and 4.2.

On the one hand, Subsection 4.1 provides an essentially well-known estimate for the Lp-error of Monte Carlo-type approximations, cf. Corollary 4.5. Corollary 4.5 is a consequenceof the well-known result stated here as Proposition 4.4, which, in turn, follows directlyfrom, e.g., Cox et al. [22, Corollary 5.11] (with M ←M , q ← 2, (E, ‖·‖E)← (Rd, ‖·‖2|Rd),(Ω,F ,P) ← (Ω,F ,P), (ξj)j∈1,2,...,M ← (Xj)j∈1,2,...,M, p ← p in the notation of [22,Corollary 5.11] and Proposition 4.4, respectively). In the proof of Corollary 4.5 we alsoapply Lemma 4.3, which is Grohs et al. [45, Lemma 2.2]. In order to make the statementsof Lemma 4.3 and Proposition 4.4 more accessible for the reader, we recall in Definition 4.1(cf., e.g., [22, Definition 5.1]) the notion of a Rademacher family and in Definition 4.2 (cf.,e.g., [22, Definition 5.4] or Gonon et al. [42, Definition 2.1]) the notion of the p-Kahane–Khintchine constant.

On the other hand, we derive in Subsection 4.2 uniform Lp-estimates for Lipschitzcontinuous random fields with a separable metric space as index set (cf. Lemmas 4.10

12

Page 13: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

and 4.11 and Corollary 4.12). These estimates are uniform in the sense that the supremumover the index set is inside the expectation belonging to the Lp-norm, which is necessarysince we intend to prove error bounds for the worst-case generalisation error, as illustratedabove. One of the elementary but crucial arguments in our derivation of such uniform Lp-estimates is given in Lemma 4.9 (cf. Lemma 4.8). Roughly speaking, Lemma 4.9 illustrateshow the Lp-norm of a supremum can be bounded from above by the supremum of certainLp-norms, where the Lp-norms are integrating over a general measure space and wherethe suprema are taken over a general (bounded) separable metric space. Furthermore, theelementary and well-known Lemmas 4.6 and 4.7, respectively, follow immediately fromBeck, Jentzen, & Kuckuck [10, (ii) in Lemma 3.13 and (ii) in Lemma 3.14] and ensurethat the mathematical statements of Lemmas 4.8, 4.9, and 4.10 do indeed make sense.

The results in Subsections 4.2 and 4.3 are in parts inspired by [10, Subsection 3.2] andwe refer, e.g., to [7, 13, 23, 31–33, 52, 67, 87, 92] and the references therein for furtherresults on the generalisation error.

4.1 Monte Carlo estimates

Definition 4.1 (Rademacher family). Let (Ω,F ,P) be a probability space and let J bea set. Then we say that (rj)j∈J is a P-Rademacher family if and only if it holds thatrj : Ω → −1, 1, j ∈ J , are independent random variables with ∀ j ∈ J : P(rj = 1) =P(rj = −1).

Definition 4.2 (p-Kahane–Khintchine constant). Let p ∈ (0,∞). Then we denote byKp ∈ (0,∞] the extended real number given by

Kp = sup

c ∈ [0,∞) :

∃R-Banach space (E, ‖·‖E) :∃ probability space (Ω,F ,P) :∃P-Rademacher family (rj)j∈N :∃ k ∈ N : ∃x1, x2, . . . , xk ∈ E \ 0 :(

E[∥∥∑k

j=1 rjxj∥∥pE

])1/p= c(E[∥∥∑k

j=1 rjxj∥∥2

E

])1/2

(43)

(cf. Definition 4.1).

Lemma 4.3. It holds for all p ∈ [2,∞) that Kp ≤√p− 1 <∞ (cf. Definition 4.2).

Proposition 4.4. Let d,M ∈ N, p ∈ [2,∞), let (Ω,F ,P) be a probability space,let Xj : Ω → Rd, j ∈ 1, 2, . . . ,M, be independent random variables, and assumemaxj∈1,2,...,M E[‖Xj‖2] <∞ (cf. Definition 3.1). Then(

E[∥∥∥∥[ M∑

j=1

Xj

]− E

[M∑j=1

Xj

]∥∥∥∥p2

])1/p≤ 2Kp

[M∑j=1

(E[‖Xj − E[Xj]‖p2

])2/p]1/2

(44)

(cf. Definition 4.2 and Lemma 4.3).

Corollary 4.5. Let d,M ∈ N, p ∈ [2,∞), let (Ω,F ,P) be a probability space,let Xj : Ω → Rd, j ∈ 1, 2, . . . ,M, be independent random variables, and assumemaxj∈1,2,...,M E[‖Xj‖2] <∞ (cf. Definition 3.1). Then(

E[∥∥∥∥ 1

M

[M∑j=1

Xj

]− E

[1

M

M∑j=1

Xj

]∥∥∥∥p2

])1/p≤ 2√p− 1√M

[max

j∈1,2,...,M

(E[‖Xj − E[Xj]‖p2

])1/p].

(45)

13

Page 14: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proof of Corollary 4.5. Observe that Proposition 4.4 and Lemma 4.3 imply that(E[∥∥∥∥ 1

M

[M∑j=1

Xj

]− E

[1

M

M∑j=1

Xj

]∥∥∥∥p2

])1/p=

1

M

(E[∥∥∥∥[ M∑

j=1

Xj

]− E

[M∑j=1

Xj

]∥∥∥∥p2

])1/p≤ 2Kp

M

[M∑j=1

(E[‖Xj − E[Xj]‖p2

])2/p]1/2

≤ 2KpM

[M

(max

j∈1,2,...,M

(E[‖Xj − E[Xj]‖p2

])2/p)]1/2

=2Kp√M

[max

j∈1,2,...,M

(E[‖Xj − E[Xj]‖p2

])1/p]≤ 2√p− 1√M

[max

j∈1,2,...,M

(E[‖Xj − E[Xj]‖p2

])1/p]

(46)

(cf. Definition 4.2). The proof of Corollary 4.5 is thus complete.

4.2 Uniform strong error estimates for random fields

Lemma 4.6. Let (E,E ) be a separable topological space, assume E 6= ∅, let (Ω,F) be ameasurable space, let fx : Ω → R, x ∈ E, be F/B(R)-measurable functions, and assumefor all ω ∈ Ω that E 3 x 7→ fx(ω) ∈ R is a continuous function. Then it holds that thefunction

Ω 3 ω 7→ supx∈E fx(ω) ∈ R ∪ ∞ (47)

is F/B(R ∪ ∞)-measurable.

Lemma 4.7. Let (E, δ) be a separable metric space, assume E 6= ∅, let L ∈ R, let(Ω,F ,P) be a probability space, let Zx : Ω→ R, x ∈ E, be random variables, and assumefor all x, y ∈ E that E[|Zx|] <∞ and |Zx−Zy| ≤ Lδ(x, y). Then it holds that the function

Ω 3 ω 7→ supx∈E|Zx(ω)− E[Zx]| ∈ [0,∞] (48)

is F/B([0,∞])-measurable.

Lemma 4.8. Let (E, δ) be a separable metric space, let N ∈ N, p, L, r1, r2, . . . , rN ∈[0,∞), z1, z2, . . . , zN ∈ E satisfy E ⊆

⋃Ni=1x ∈ E : δ(x, zi) ≤ ri, let (Ω,F , µ) be a

measure space, let Zx : Ω→ R, x ∈ E, be F/B(R)-measurable functions, and assume forall ω ∈ Ω, x, y ∈ E that |Zx(ω)− Zy(ω)| ≤ Lδ(x, y). Then∫

Ω

supx∈E|Zx(ω)|p µ(dω) ≤

N∑i=1

∫Ω

(Lri + |Zzi(ω)|)p µ(dω) (49)

(cf. Lemma 4.6).

Proof of Lemma 4.8. Throughout this proof let B1, B2, . . . , BN ⊆ E satisfy for all i ∈1, 2, . . . , N that Bi = x ∈ E : δ(x, zi) ≤ ri. Note that the fact that E =

⋃Ni=1Bi

shows for all ω ∈ Ω that

supx∈E|Zx(ω)| = supx∈(⋃Ni=1Bi)

|Zx(ω)| = maxi∈1,2,...,N supx∈Bi |Zx(ω)|. (50)

14

Page 15: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

This establishes that∫Ω

supx∈E|Zx(ω)|p µ(dω) =

∫Ω

maxi∈1,2,...,N

supx∈Bi|Zx(ω)|p µ(dω)

≤∫

Ω

N∑i=1

supx∈Bi|Zx(ω)|p µ(dω) =

N∑i=1

∫Ω

supx∈Bi|Zx(ω)|p µ(dω).

(51)

Furthermore, the assumption that ∀ω ∈ Ω, x, y ∈ E : |Zx(ω)− Zy(ω)| ≤ Lδ(x, y) impliesfor all ω ∈ Ω, i ∈ 1, 2, . . . , N, x ∈ Bi that

|Zx(ω)| = |Zx(ω)− Zzi(ω) + Zzi(ω)| ≤ |Zx(ω)− Zzi(ω)|+ |Zzi(ω)|≤ Lδ(x, zi) + |Zzi(ω)| ≤ Lri + |Zzi(ω)|.

(52)

Combining this with (51) proves that∫Ω

supx∈E|Zx(ω)|p µ(dω) ≤

N∑i=1

∫Ω

(Lri + |Zzi(ω)|)p µ(dω). (53)

The proof of Lemma 4.8 is thus complete.

Lemma 4.9. Let p, L, r ∈ (0,∞), let (E, δ) be a separable metric space, let (Ω,F , µ)be a measure space, assume E 6= ∅ and µ(Ω) 6= 0, let Zx : Ω → R, x ∈ E, be F/B(R)-measurable functions, and assume for all ω ∈ Ω, x, y ∈ E that |Zx(ω)−Zy(ω)| ≤ Lδ(x, y).Then ∫

Ω

supx∈E|Zx(ω)|p µ(dω) ≤ C(E,δ),r

[supx∈E

∫Ω

(Lr + |Zx(ω)|)p µ(dω)

](54)

(cf. Definition 3.2 and Lemma 4.6).

Proof of Lemma 4.9. Throughout this proof assume w.l.o.g. that C(E,δ),r <∞, let N ∈ Nbe given by N = C(E,δ),r, and let z1, z2, . . . , zN ∈ E satisfy E ⊆

⋃Ni=1x ∈ E : δ(x, zi) ≤ r.

Note that Lemma 4.8 (with r1 ← r, r2 ← r, . . . , rN ← r in the notation of Lemma 4.8)establishes that∫

Ω

supx∈E|Zx(ω)|p µ(dω) ≤

N∑i=1

∫Ω

(Lr + |Zzi(ω)|)p µ(dω)

≤N∑i=1

[supx∈E

∫Ω

(Lr + |Zx(ω)|)p µ(dω)

]= N

[supx∈E

∫Ω

(Lr + |Zx(ω)|)p µ(dω)

].

(55)

The proof of Lemma 4.9 is thus complete.

Lemma 4.10. Let p ∈ [1,∞), L, r ∈ (0,∞), let (E, δ) be a separable metric space, assumeE 6= ∅, let (Ω,F ,P) be a probability space, let Zx : Ω → R, x ∈ E, be random variables,and assume for all x, y ∈ E that E[|Zx|] <∞ and |Zx − Zy| ≤ Lδ(x, y). Then(

E[supx∈E|Zx − E[Zx]|p

])1/p ≤ (C(E,δ),r)1/p[2Lr + supx∈E

(E[|Zx − E[Zx]|p

])1/p](56)

(cf. Definition 3.2 and Lemma 4.7).

Proof of Lemma 4.10. Throughout this proof let Yx : Ω→ R, x ∈ E, satisfy for all x ∈ E,ω ∈ Ω that Yx(ω) = Zx(ω)− E[Zx]. Note that it holds for all ω ∈ Ω, x, y ∈ E that

|Yx(ω)− Yy(ω)| = |(Zx(ω)− E[Zx])− (Zy(ω)− E[Zy])|≤ |Zx(ω)− Zy(ω)|+ |E[Zx]− E[Zy]| ≤ Lδ(x, y) + E[|Zx − Zy|]≤ 2Lδ(x, y).

(57)

15

Page 16: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Combining this with Lemma 4.9 (with L← 2L, (Ω,F , µ)← (Ω,F ,P), (Zx)x∈E ← (Yx)x∈Ein the notation of Lemma 4.9) implies that(

E[supx∈E|Zx − E[Zx]|p

])1/p=(E[supx∈E|Yx|p

])1/p≤ (C(E,δ),r)

1/p[supx∈E

(E[(2Lr + |Yx|)p

])1/p]≤ (C(E,δ),r)

1/p[2Lr + supx∈E

(E[|Yx|p

])1/p]= (C(E,δ),r)

1/p[2Lr + supx∈E

(E[|Zx − E[Zx]|p

])1/p].

(58)

The proof of Lemma 4.10 is thus complete.

Lemma 4.11. Let M ∈ N, p ∈ [2,∞), L, r ∈ (0,∞), let (E, δ) be a separable metricspace, assume E 6= ∅, let (Ω,F ,P) be a probability space, for every x ∈ E let Yx,j : Ω →R, j ∈ 1, 2, . . . ,M, be independent random variables, assume for all x, y ∈ E, j ∈1, 2, . . . ,M that E[|Yx,j|] <∞ and |Yx,j − Yy,j| ≤ Lδ(x, y), and let Zx : Ω→ R, x ∈ E,satisfy for all x ∈ E that

Zx =1

M

[M∑j=1

Yx,j

]. (59)

Then

(i) it holds for all x ∈ E that E[|Zx|] <∞,

(ii) it holds that the function Ω 3 ω 7→ supx∈E|Zx(ω)− E[Zx]| ∈ [0,∞] is F/B([0,∞])-measurable, and

(iii) it holds that(E[supx∈E|Zx − E[Zx]|p

])1/p≤ 2(C(E,δ),r)

1/p[Lr +

√p−1√M

(supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p)] (60)

(cf. Definition 3.2).

Proof of Lemma 4.11. Note that the assumption that ∀x ∈ E, j ∈ 1, 2, . . . ,M :E[|Yx,j|] <∞ implies for all x ∈ E that

E[|Zx|] = E[

1

M

∣∣∣∣ M∑j=1

Yx,j

∣∣∣∣] ≤ 1

M

[M∑j=1

E[|Yx,j|]]≤ max

j∈1,2,...,ME[|Yx,j|] <∞. (61)

This proves (i). Next observe that the assumption that ∀x, y ∈ E, j ∈ 1, 2, . . . ,M :|Yx,j − Yy,j| ≤ Lδ(x, y) demonstrates for all x, y ∈ E that

|Zx − Zy| =1

M

∣∣∣∣[ M∑j=1

Yx,j

]−[M∑j=1

Yy,j

]∣∣∣∣ ≤ 1

M

[M∑j=1

|Yx,j − Yy,j|]≤ Lδ(x, y). (62)

Combining this with (i) and Lemma 4.7 establishes (ii). It thus remains to show (iii). Forthis note that (i), (62), and Lemma 4.10 yield that(

E[supx∈E|Zx − E[Zx]|p

])1/p ≤ (C(E,δ),r)1/p[2Lr + supx∈E

(E[|Zx − E[Zx]|p

])1/p]. (63)

16

Page 17: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Moreover, (61) and Corollary 4.5 (with d ← 1, (Xj)j∈1,2,...,M ← (Yx,j)j∈1,2,...,M forx ∈ E in the notation of Corollary 4.5) prove for all x ∈ E that

(E[|Zx − E[Zx]|p

])1/p=

(E[∣∣∣∣ 1

M

[M∑j=1

Yx,j

]− E

[1

M

M∑j=1

Yx,j

]∣∣∣∣p])1/p≤ 2√p− 1√M

[max

j∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p].

(64)

This and (63) imply that(E[supx∈E|Zx − E[Zx]|p

])1/p≤ (C(E,δ),r)

1/p[2Lr + 2

√p−1√M

(supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p)]= 2(C(E,δ),r)

1/p[Lr +

√p−1√M

(supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p)].

(65)

The proof of Lemma 4.11 is thus complete.

Corollary 4.12. Let M ∈ N, p ∈ [2,∞), L,C ∈ (0,∞), let (E, δ) be a separable metricspace, assume E 6= ∅, let (Ω,F ,P) be a probability space, for every x ∈ E let Yx,j : Ω →R, j ∈ 1, 2, . . . ,M, be independent random variables, assume for all x, y ∈ E, j ∈1, 2, . . . ,M that E[|Yx,j|] <∞ and |Yx,j − Yy,j| ≤ Lδ(x, y), and let Zx : Ω→ R, x ∈ E,satisfy for all x ∈ E that

Zx =1

M

[M∑j=1

Yx,j

]. (66)

Then

(i) it holds for all x ∈ E that E[|Zx|] <∞,

(ii) it holds that the function Ω 3 ω 7→ supx∈E|Zx(ω)− E[Zx]| ∈ [0,∞] is F/B([0,∞])-measurable, and

(iii) it holds that(E[supx∈E|Zx − E[Zx]|p

])1/p≤ 2

√p−1√M

(C

(E,δ),C√p−1

L√M

)1/p[C + supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p] (67)

(cf. Definition 3.2).

Proof of Corollary 4.12. Note that Lemma 4.11 shows (i) and (ii). In addition, Lem-ma 4.11 (with r ← C

√p−1/(L

√M) in the notation of Lemma 4.11) ensures that(

E[supx∈E|Zx − E[Zx]|p

])1/p≤ 2(C

(E,δ),C√p−1

L√M

)1/p[LC√p−1

L√M

+√p−1√M

(supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p)]= 2

√p−1√M

(C

(E,δ),C√p−1

L√M

)1/p[C + supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p]. (68)

This establishes (iii) and thus completes the proof of Corollary 4.12.

17

Page 18: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

4.3 Strong convergence rates for the generalisation error

Lemma 4.13. Let M ∈ N, p ∈ [2,∞), L,C, b ∈ (0,∞), let (E, δ) be a separablemetric space, assume E 6= ∅, let (Ω,F ,P) be a probability space, let Xx,j : Ω → R,j ∈ 1, 2, . . . ,M, x ∈ E, and Yj : Ω → R, j ∈ 1, 2, . . . ,M, be functions, assumefor every x ∈ E that (Xx,j, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. random variables, assumefor all x, y ∈ E, j ∈ 1, 2, . . . ,M that |Xx,j − Yj| ≤ b and |Xx,j − Xy,j| ≤ Lδ(x, y), letR : E → [0,∞) satisfy for all x ∈ E that R(x) = E[|Xx,1 − Y1|2], and let R : E × Ω →[0,∞) satisfy for all x ∈ E, ω ∈ Ω that

R(x, ω) =1

M

[M∑j=1

|Xx,j(ω)− Yj(ω)|2]. (69)

Then

(i) it holds that the function Ω 3 ω 7→ supx∈E|R(x, ω)−R(x)| ∈ [0,∞] is F/B([0,∞])-measurable and

(ii) it holds that

(E[supx∈E|R(x)−R(x)|p

])1/p ≤ (C(E,δ),Cb

√p−1

2L√M

)1/p[2(C + 1)b2√p− 1√

M

](70)

(cf. Definition 3.2).

Proof of Lemma 4.13. Throughout this proof let Yx,j : Ω→ R, j ∈ 1, 2, . . . ,M, x ∈ E,satisfy for all x ∈ E, j ∈ 1, 2, . . . ,M that Yx,j = |Xx,j−Yj|2. Note that the assumptionthat for every x ∈ E it holds that (Xx,j, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. random variablesensures for all x ∈ E that

E[R(x)] =1

M

[M∑j=1

E[|Xx,j − Yj|2

]]=M E

[|Xx,1 − Y1|2

]M

= R(x). (71)

Furthermore, the assumption that ∀x ∈ E, j ∈ 1, 2, . . . ,M : |Xx,j − Yj| ≤ b shows forall x ∈ E, j ∈ 1, 2, . . . ,M that

E[|Yx,j|] = E[|Xx,j − Yj|2] ≤ b2 <∞, (72)

Yx,j − E[Yx,j] = |Xx,j − Yj|2 − E[|Xx,j − Yj|2

]≤ |Xx,j − Yj|2 ≤ b2, (73)

and

E[Yx,j]− Yx,j = E[|Xx,j − Yj|2

]− |Xx,j − Yj|2 ≤ E

[|Xx,j − Yj|2

]≤ b2. (74)

Combining (72)–(74) implies for all x ∈ E, j ∈ 1, 2, . . . ,M that(E[|Yx,j − E[Yx,j]|p

])1/p ≤ (E[b2p])1/p

= b2. (75)

Moreover, note that the assumptions that ∀x, y ∈ E, j ∈ 1, 2, . . . ,M : [|Xx,j − Yj| ≤b and |Xx,j −Xy,j| ≤ Lδ(x, y)] and the fact that ∀x1, x2, y ∈ R : (x1 − y)2 − (x2 − y)2 =(x1 − x2)((x1 − y) + (x2 − y)) establish for all x, y ∈ E, j ∈ 1, 2, . . . ,M that

|Yx,j − Yy,j| = |(Xx,j − Yj)2 − (Xy,j − Yj)2|≤ |Xx,j −Xy,j|(|Xx,j − Yj|+ |Xy,j − Yj|)≤ 2b|Xx,j −Xy,j| ≤ 2bLδ(x, y).

(76)

18

Page 19: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Combining this, (71), (72), and the fact that for every x ∈ E it holds that Yx,j, j ∈1, 2, . . . ,M, are independent random variables with Corollary 4.12 (with L ← 2bL,C ← Cb2, (Yx,j)x∈E, j∈1,2,...,M ← (Yx,j)x∈E, j∈1,2,...,M, (Zx)x∈E ← (Ω 3 ω 7→ R(x, ω) ∈R)x∈E in the notation of Corollary 4.12) and (75) proves (i) and(

E[supx∈E|R(x)−R(x)|p

])1/p=(E[supx∈E|R(x)− E[R(x)]|p

])1/p≤ 2

√p−1√M

(C

(E,δ),Cb2√p−1

2bL√M

)1/p[Cb2 + supx∈E maxj∈1,2,...,M

(E[|Yx,j − E[Yx,j]|p

])1/p]≤ 2

√p−1√M

(C

(E,δ),Cb√p−1

2L√M

)1/p[Cb2 + b2] =

(C

(E,δ),Cb√p−1

2L√M

)1/p[2(C + 1)b2√p− 1√

M

].

(77)

This shows (ii) and thus completes the proof of Lemma 4.13.

Proposition 4.14. Let d,d,M ∈ N, L, b ∈ (0,∞), α ∈ R, β ∈ (α,∞), D ⊆ Rd,let (Ω,F ,P) be a probability space, let Xj : Ω → D, j ∈ 1, 2, . . . ,M, and Yj : Ω → R,j ∈ 1, 2, . . . ,M, be functions, assume that (Xj, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. randomvariables, let f = (fθ)θ∈[α,β]d : [α, β]d → C(D,R) be a function, assume for all θ, ϑ ∈[α, β]d, j ∈ 1, 2, . . . ,M, x ∈ D that |fθ(Xj)−Yj| ≤ b and |fθ(x)− fϑ(x)| ≤ L‖θ−ϑ‖∞,let R : [α, β]d → [0,∞) satisfy for all θ ∈ [α, β]d that R(θ) = E[|fθ(X1) − Y1|2], and letR : [α, β]d × Ω→ [0,∞) satisfy for all θ ∈ [α, β]d, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|fθ(Xj(ω))− Yj(ω)|2]

(78)

(cf. Definition 3.1). Then

(i) it holds that the function Ω 3 ω 7→ supθ∈[α,β]d|R(θ, ω) − R(θ)| ∈ [0,∞] is F/B([0,∞])-measurable and

(ii) it holds for all p ∈ (0,∞) that(E[supθ∈[α,β]d|R(θ)−R(θ)|p

])1/p≤ inf

C,ε∈(0,∞)

[2(C + 1)b2 max1, [2

√ML(β − α)(Cb)−1]ε

√max1, p, d/ε√

M

]

≤ infC∈(0,∞)

[2(C + 1)b2

√emax1, p,d ln(4ML2(β − α)2(Cb)−2)√

M

].

(79)

Proof of Proposition 4.14. Throughout this proof let p ∈ (0,∞), let (κC)C∈(0,∞) ⊆ (0,∞)satisfy for all C ∈ (0,∞) that 2

√ML(β−α)/(Cb), let Xθ,j : Ω → R, j ∈ 1, 2, . . . ,M,

θ ∈ [α, β]d, satisfy for all θ ∈ [α, β]d, j ∈ 1, 2, . . . ,M that Xθ,j = fθ(Xj), and letδ : ([α, β]d)× ([α, β]d)→ [0,∞) satisfy for all θ, ϑ ∈ [α, β]d that δ(θ, ϑ) = ‖θ−ϑ‖∞. Firstof all, note that the assumption that ∀ θ ∈ [α, β]d, j ∈ 1, 2, . . . ,M : |fθ(Xj) − Yj| ≤ bimplies for all θ ∈ [α, β]d, j ∈ 1, 2, . . . ,M that

|Xθ,j − Yj| = |fθ(Xj)− Yj| ≤ b. (80)

In addition, the assumption that ∀ θ, ϑ ∈ [α, β]d, x ∈ D : |fθ(x) − fϑ(x)| ≤ L‖θ − ϑ‖∞ensures for all θ, ϑ ∈ [α, β]d, j ∈ 1, 2, . . . ,M that

|Xθ,j −Xϑ,j| = |fθ(Xj)− fϑ(Xj)| ≤ supx∈D|fθ(x)− fϑ(x)| ≤ L‖θ− ϑ‖∞ = Lδ(θ, ϑ). (81)

19

Page 20: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Combining this, (80), and the fact that for every θ ∈ [α, β]d it holds that (Xθ,j, Yj),j ∈ 1, 2, . . . ,M, are i.i.d. random variables with Lemma 4.13 (with p ← q, C ← C,(E, δ) ← ([α, β]d, δ), (Xx,j)x∈E, j∈1,2,...,M ← (Xθ,j)θ∈[α,β]d, j∈1,2,...,M for q ∈ [2,∞), C ∈(0,∞) in the notation of Lemma 4.13) demonstrates for all C ∈ (0,∞), q ∈ [2,∞) thatthe function Ω 3 ω 7→ supθ∈[α,β]d|R(θ, ω)−R(θ)| ∈ [0,∞] is F/B([0,∞])-measurable and

(E[supθ∈[α,β]d|R(θ)−R(θ)|q

])1/q ≤ (C([α,β]d,δ),Cb

√q−1

2L√M

)1/q[2(C + 1)b2√q − 1√

M

](82)

(cf. Definition 3.2). This finishes the proof of (i). Next observe that (ii) in Lemma 3.3(with d ← d, a ← α, b ← β, r ← r for r ∈ (0,∞) in the notation of Lemma 3.3) showsfor all r ∈ (0,∞) that

C([α,β]d,δ),r ≤ 1[0,r]

(β−α

2

)+(β−αr

)d1(r,∞)

(β−α

2

)≤ max

1,(β−αr

)d(1[0,r]

(β−α

2

)+ 1(r,∞)

(β−α

2

))= max

1,(β−αr

)d.

(83)

This yields for all C ∈ (0,∞), q ∈ [2,∞) that(C

([α,β]d,δ),Cb√q−1

2L√M

)1/q≤ max

1,(

2(β−α)L√M

Cb√q−1

)dq

≤ max

1,(

2(β−α)L√M

Cb

)dq

= max

1, (κC)

dq

.

(84)

Jensen’s inequality and (82) hence prove for all C, ε ∈ (0,∞) that(E[supθ∈[α,β]d|R(θ)−R(θ)|p

])1/p≤(E[supθ∈[α,β]d |R(θ)−R(θ)|max2,p,d/ε]) 1

max2,p,d/ε

≤ max

1, (κC)d

max2,p,d/ε

2(C + 1)b2√

max2, p, d/ε − 1√M

= max

1, (κC)mind/2,d/p,ε2(C + 1)b2√

max1, p− 1, d/ε− 1√M

≤2(C + 1)b2 max1, (κC)ε

√max1, p, d/ε√

M.

(85)

Next note that the fact that ∀ a ∈ (1,∞) : a1/(2 ln(a)) = eln(a)/(2 ln(a)) = e1/2 =√e ≥ 1 ensures

for all C ∈ (0,∞) with κC > 1 that

infε∈(0,∞)

[2(C + 1)b2 max1, (κC)ε

√max1, p, d/ε√

M

]

≤2(C + 1)b2 max1, (κC)1/(2 ln(κC ))

√max1, p, 2d ln(κC)√

M

=2(C + 1)b2

√emax1, p,d ln([κC ]2)√

M.

(86)

20

Page 21: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

In addition, observe that it holds for all C ∈ (0,∞) with κC ≤ 1 that

infε∈(0,∞)

[2(C + 1)b2 max1, (κC)ε

√max1, p, d/ε√

M

]

≤ infε∈(0,∞)

[2(C + 1)b2

√max1, p, d/ε√M

]≤

2(C + 1)b2√

max1, p√M

≤2(C + 1)b2

√emax1, p,d ln([κC ]2)√

M.

(87)

Combining (85) with (86) and (87) demonstrates that(E[supθ∈[α,β]d |R(θ)−R(θ)|p

])1/p≤ inf

C,ε∈(0,∞)

[2(C + 1)b2 max1, (κC)ε

√max1, p, d/ε√

M

]

= infC,ε∈(0,∞)

[2(C + 1)b2 max1, [2

√ML(β − α)(Cb)−1]ε

√max1, p, d/ε√

M

]

≤ infC∈(0,∞)

[2(C + 1)b2

√emax1, p,d ln([κC ]2)√

M

]

= infC∈(0,∞)

[2(C + 1)b2

√emax1, p,d ln(4ML2(β − α)2(Cb)−2)√

M

].

(88)

This establishes (ii) and thus completes the proof of Proposition 4.14.

Corollary 4.15. Let d,d,L,M ∈ N, B, b ∈ [1,∞), u ∈ R, v ∈ [u + 1,∞), l =(l0, l1, . . . , lL) ∈ NL+1, D ⊆ [−b, b]d, assume l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let

(Ω,F ,P) be a probability space, let Xj : Ω → D, j ∈ 1, 2, . . . ,M, and Yj : Ω → [u, v],j ∈ 1, 2, . . . ,M, be functions, assume that (Xj, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. ran-dom variables, let R : [−B,B]d → [0,∞) satisfy for all θ ∈ [−B,B]d that R(θ) =E[|N θ,l

u,v (X1) − Y1|2], and let R : [−B,B]d × Ω → [0,∞) satisfy for all θ ∈ [−B,B]d,ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

](89)

(cf. Definition 2.8). Then

(i) it holds that the function Ω 3 ω 7→ supθ∈[−B,B]d|R(θ, ω) − R(θ)| ∈ [0,∞] is F/B([0,∞])-measurable and

(ii) it holds for all p ∈ (0,∞) that(E[supθ∈[−B,B]d|R(θ)−R(θ)|p

])1/p≤ 9(v − u)2L(‖l‖∞ + 1)

√maxp, ln(4(Mb)1/L(‖l‖∞ + 1)B)√

M

≤ 9(v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√M

(90)

(cf. Definition 3.1).

21

Page 22: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proof of Corollary 4.15. Throughout this proof let d ∈ N be given by d =∑L

i=1 li(li−1+1),let L ∈ (0,∞) be given by L = bL(‖l‖∞ + 1)LBL−1, let f = (fθ)θ∈[−B,B]d : [−B,B]d →C(D,R) satisfy for all θ ∈ [−B,B]d, x ∈ D that fθ(x) = N θ,l

u,v (x), let R : [−B,B]d →[0,∞) satisfy for all θ ∈ [−B,B]d that R(θ) = E[|fθ(X1) − Y1|2] = E[|N θ,l

u,v (X1) − Y1|2],and let R : [−B,B]d × Ω→ [0,∞) satisfy for all θ ∈ [−B,B]d, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|fθ(Xj(ω))− Yj(ω)|2]

=1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]. (91)

Note that the fact that ∀ θ ∈ Rd, x ∈ Rd : N θ,lu,v (x) ∈ [u, v] and the assumption that

∀ j ∈ 1, 2, . . . ,M : Yj(Ω) ⊆ [u, v] imply for all θ ∈ [−B,B]d, j ∈ 1, 2, . . . ,M that

|fθ(Xj)− Yj| = |N θ,lu,v (Xj)− Yj| ≤ supy1,y2∈[u,v]|y1 − y2| = v − u. (92)

Moreover, the assumptions that D ⊆ [−b, b]d, l0 = d, and lL = 1, Beck, Jentzen, &Kuckuck [10, Corollary 2.37] (with a ← −b, b ← b, u ← u, v ← v, d ← d, L ← L, l ← lin the notation of [10, Corollary 2.37]), and the assumptions that b ≥ 1 and B ≥ 1 ensurefor all θ, ϑ ∈ [−B,B]d, x ∈ D that

|fθ(x)− fϑ(x)| ≤ supy∈[−b,b]d |N θ,lu,v (y)−N ϑ,l

u,v (y)|≤ L max1, b(‖l‖∞ + 1)L(max1, ‖θ‖∞, ‖ϑ‖∞)L−1‖θ − ϑ‖∞≤ bL(‖l‖∞ + 1)LBL−1‖θ − ϑ‖∞ = L‖θ − ϑ‖∞.

(93)

Furthermore, the facts that d ≥ d and ∀ θ = (θ1, θ2, . . . , θd) ∈ Rd : N θ,lu,v = N (θ1,θ2,...,θd),l

u,v

prove for all ω ∈ Ω that

supθ∈[−B,B]d |R(θ, ω)−R(θ)| = supθ∈[−B,B]d |R(θ, ω)−R(θ)|. (94)

Next observe that (92), (93), Proposition 4.14 (with d ← d, b ← v − u, α ← −B,β ← B, R ← R, R ← R in the notation of Proposition 4.14), and the facts thatv − u ≥ (u + 1) − u = 1 and d ≤ L‖l‖∞(‖l‖∞ + 1) ≤ L(‖l‖∞ + 1)2 demonstrate forall p ∈ (0,∞) that the function Ω 3 ω 7→ supθ∈[−B,B]d |R(θ, ω) − R(θ)| ∈ [0,∞] isF/B([0,∞])-measurable and(

E[supθ∈[−B,B]d |R(θ)−R(θ)|p

])1/p≤ inf

C∈(0,∞)

[2(C + 1)(v − u)2

√emax1, p, d ln(4ML2(2B)2(C[v − u])−2)√

M

]

≤ infC∈(0,∞)

[2(C + 1)(v − u)2

√emax1, p,L(‖l‖∞ + 1)2 ln(24ML2B2C−2)√

M

].

(95)

This and (94) establish (i). In addition, combining (94)–(95) with the fact that 26L2 ≤26 · 22(L−1) = 24+2L ≤ 24L+2L = 26L and the facts that 3 ≥ e, B ≥ 1, L ≥ 1, M ≥ 1, and

22

Page 23: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

b ≥ 1 shows for all p ∈ (0,∞) that(E[supθ∈[−B,B]d|R(θ)−R(θ)|p

])1/p=(E[supθ∈[−B,B]d |R(θ)−R(θ)|p

])1/p≤

2(1/2 + 1)(v − u)2√emax1, p,L(‖l‖∞ + 1)2 ln(24ML2B222)√

M

=3(v − u)2

√emaxp,L(‖l‖∞ + 1)2 ln(26Mb2L2(‖l‖∞ + 1)2LB2L)√

M

≤ 3(v − u)2√emaxp, 3L2(‖l‖∞ + 1)2 ln([26LMb2(‖l‖∞ + 1)2LB2L]1/(3L))√

M

≤ 3(v − u)2√

3 maxp, 3L2(‖l‖∞ + 1)2 ln(22(Mb2)1/(3L)(‖l‖∞ + 1)B)√M

≤ 9(v − u)2L(‖l‖∞ + 1)√

maxp, ln(4(Mb)1/L(‖l‖∞ + 1)B)√M

.

(96)

Furthermore, note that the fact that ∀n ∈ N : n ≤ 2n−1 and the fact that ‖l‖∞ ≥ 1 implythat

4(‖l‖∞ + 1) ≤ 22 · 2(‖l‖∞+1)−1 = 23 · 2(‖l‖∞+1)−2 ≤ 32 · 3(‖l‖∞+1)−2 = 3(‖l‖∞+1). (97)

This demonstrates for all p ∈ (0,∞) that

9(v − u)2L(‖l‖∞ + 1)√

maxp, ln(4(Mb)1/L(‖l‖∞ + 1)B)√M

≤ 9(v − u)2L(‖l‖∞ + 1)√

maxp, (‖l‖∞ + 1) ln([3(‖l‖∞+1)(Mb)1/LB]1/(‖l‖∞+1))√M

≤ 9(v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√M

.

(98)

Combining this with (96) shows (ii). The proof of Corollary 4.15 is thus complete.

5 Analysis of the optimisation error

The main result of this section, Proposition 5.6, establishes that the optimisation error ofthe Minimum Monte Carlo method applied to a Lipschitz continuous random field witha d-dimensional hypercube as index set, where d ∈ N, converges in the probabilisticallystrong sense with rate 1/d with respect to the number of samples used, provided thatthe sample indices are continuous uniformly drawn from the index hypercube (cf. (ii)in Proposition 5.6). We refer to Beck, Jentzen, & Kuckuck [10, Lemmas 3.22–3.23] foranalogous results for convergence in probability instead of strong convergence and to Becket al. [8, Lemma 3.5] for a related result. Corollary 5.8 below specialises Proposition 5.6to the case where the empirical risk from deep learning based empirical risk minimisationwith quadratic loss function indexed by a hypercube of DNN parameter vectors plays therole of the random field under consideration. In the proof of Corollary 5.8 we make use ofthe elementary and well-known fact that this choice for the random field is indeed Lipschitzcontinuous, which is the assertion of Lemma 5.7. Further results on the optimisation errorin the context of stochastic approximation can be found, e.g., in [2, 4, 12, 18, 25, 26, 28,29, 38, 60, 62, 63, 65, 88, 97, 98] and the references therein.

23

Page 24: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

The proof of the main result of this section, Proposition 5.6, crucially relies (cf.Lemma 5.5) on the complementary distribution function formula (cf., e.g., Elbrachteret al. [35, Lemma 2.2]) and the elementary estimate for the beta function given in Corol-lary 5.4. In order to prove Corollary 5.4, we first collect a few basic facts about thegamma and the beta function in the elementary and well-known Lemma 5.1 and derivefrom these in Proposition 5.3 further elementary and essentially well-known propertiesof the gamma function. In particular, the inequalities in (100) in Proposition 5.3 beloware slightly reformulated versions of the well-known inequalities called Wendel’s doubleinequality (cf. Wendel [94]) or Gautschi’s double inequality (cf. Gautschi [40]); cf., e.g.,Qi [81, Subsection 2.1 and Subsection 2.4].

5.1 Properties of the gamma and the beta function

Lemma 5.1. Let Γ: (0,∞)→ (0,∞) satisfy for all x ∈ (0,∞) that Γ(x) =∫∞

0tx−1e−t dt

and let B : (0,∞)2 → (0,∞) satisfy for all x, y ∈ (0,∞) that B(x, y) =∫ 1

0tx−1(1−t)y−1 dt.

Then

(i) it holds for all x ∈ (0,∞) that Γ(x+ 1) = xΓ(x),

(ii) it holds that Γ(1) = Γ(2) = 1, and

(iii) it holds for all x, y ∈ (0,∞) that B(x, y) = Γ(x)Γ(y)Γ(x+y)

.

Lemma 5.2. It holds for all α, x ∈ [0, 1] that (1− x)α ≤ 1− αx.

Proof of Lemma 5.2. Note that the fact that for every y ∈ [0,∞) it holds that the function[0,∞) 3 z 7→ yz ∈ [0,∞) is a convex function implies for all α, x ∈ [0, 1] that

(1− x)α = (1− x)α·1+(1−α)·0

≤ α(1− x)1 + (1− α)(1− x)0

= α− αx+ 1− α = 1− αx.(99)

The proof of Lemma 5.2 is thus complete.

Proposition 5.3. Let Γ: (0,∞) → (0,∞) satisfy for all x ∈ (0,∞) that Γ(x) =∫∞0tx−1e−t dt and let z· : (0,∞) → N0 satisfy for all x ∈ (0,∞) that zx = max([0, x) ∩

N0). Then

(i) it holds that Γ: (0,∞)→ (0,∞) is a convex function,

(ii) it holds for all x ∈ (0,∞) that Γ(x+ 1) = xΓ(x) ≤ xzx ≤ max1, xx,

(iii) it holds for all x ∈ (0,∞), α ∈ [0, 1] that

(maxx+ α− 1, 0)α ≤ x

(x+ α)1−α ≤Γ(x+ α)

Γ(x)≤ xα, (100)

and

(iv) it holds for all x ∈ (0,∞), α ∈ [0,∞) that

(maxx+ minα− 1, 0, 0)α ≤ Γ(x+ α)

Γ(x)≤ (x+ maxα− 1, 0)α. (101)

24

Page 25: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proof of Proposition 5.3. First, observe that the fact that for every t ∈ (0,∞) it holdsthat the function R 3 x 7→ tx ∈ (0,∞) is a convex function implies for all x, y ∈ (0,∞),α ∈ [0, 1] that

Γ(αx+ (1− α)y) =

∫ ∞0

tαx+(1−α)y−1e−t dt =

∫ ∞0

tαx+(1−α)yt−1e−t dt

≤∫ ∞

0

(αtx + (1− α)ty)t−1e−t dt

= α

∫ ∞0

tx−1e−t dt+ (1− α)

∫ ∞0

ty−1e−t dt

= αΓ(x) + (1− α)Γ(y).

(102)

This shows (i).Second, note that (ii) in Lemma 5.1 and (i) establish for all α ∈ [0, 1] that

Γ(α + 1) = Γ(α · 2 + (1− α) · 1) ≤ αΓ(2) + (1− α)Γ(1) = α + (1− α) = 1. (103)

This yields for all x ∈ (0, 1] that

Γ(x+ 1) ≤ 1 = xzx = max1, xx. (104)

Induction, (i) in Lemma 5.1, and the fact that ∀x ∈ (0,∞) : x− zx ∈ (0, 1] hence ensurefor all x ∈ [1,∞) that

Γ(x+ 1) =

[zx∏i=1

(x− i+ 1)

]Γ(x− zx + 1) ≤ xzxΓ(x− zx + 1) ≤ xzx ≤ xx = max1, xx.

(105)Combining this with again (i) in Lemma 5.1 and (104) establishes (ii).

Third, note that Holder’s inequality and (i) in Lemma 5.1 prove for all x ∈ (0,∞),α ∈ [0, 1] that

Γ(x+ α) =

∫ ∞0

tx+α−1e−t dt =

∫ ∞0

tαxe−αtt(1−α)x−(1−α)e−(1−α)t dt

=

∫ ∞0

[txe−t]α[tx−1e−t]1−α dt

≤(∫ ∞

0

txe−t dt

)α(∫ ∞0

tx−1e−t dt

)1−α

= [Γ(x+ 1)]α[Γ(x)]1−α = xα[Γ(x)]α[Γ(x)]1−α

= xαΓ(x).

(106)

This and again (i) in Lemma 5.1 demonstrate for all x ∈ (0,∞), α ∈ [0, 1] that

xΓ(x) = Γ(x+ 1) = Γ(x+ α + (1− α)) ≤ (x+ α)1−αΓ(x+ α). (107)

Combining (106) and (107) yields for all x ∈ (0,∞), α ∈ [0, 1] that

x

(x+ α)1−α ≤Γ(x+ α)

Γ(x)≤ xα. (108)

Furthermore, observe that (i) in Lemma 5.1 and (108) imply for all x ∈ (0,∞), α ∈ [0, 1]that

Γ(x+ α)

Γ(x+ 1)=

Γ(x+ α)

xΓ(x)≤ xα−1. (109)

25

Page 26: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

This shows for all α ∈ [0, 1], x ∈ (α,∞) that

Γ(x)

Γ(x+ (1− α))=

Γ((x− α) + α)

Γ((x− α) + 1)≤ (x− α)α−1 =

1

(x− α)1−α . (110)

This, in turn, ensures for all α ∈ [0, 1], x ∈ (1− α,∞) that

(x+ α− 1)α = (x− (1− α))α ≤ Γ(x+ α)

Γ(x). (111)

Next note that Lemma 5.2 proves for all x ∈ (0,∞), α ∈ [0, 1] that

(maxx+ α− 1, 0)α = (x+ α)α(

maxx+ α− 1, 0x+ α

)α= (x+ α)α

(max

1− 1

x+ α, 0

)α≤ (x+ α)α

(1− α

x+ α

)= (x+ α)α

(x

x+ α

)=

x

(x+ α)1−α .

(112)

This and (108) establish (iii).Fourth, we show (iv). For this let b·c : [0,∞) → N0 satisfy for all x ∈ [0,∞) that

bxc = max([0, x] ∩ N0). Observe that induction, (i) in Lemma 5.1, the fact that ∀α ∈[0,∞) : α− bαc ∈ [0, 1), and (iii) demonstrate for all x ∈ (0,∞), α ∈ [0,∞) that

Γ(x+ α)

Γ(x)=

[bαc∏i=1

(x+ α− i)]

Γ(x+ α− bαc)Γ(x)

≤[bαc∏i=1

(x+ α− i)]xα−bαc

≤ (x+ α− 1)bαcxα−bαc

≤ (x+ maxα− 1, 0)bαc(x+ maxα− 1, 0)α−bαc

= (x+ maxα− 1, 0)α.

(113)

Furthermore, again the fact that ∀α ∈ [0,∞) : α−bαc ∈ [0, 1), (iii), induction, and (i) inLemma 5.1 imply for all x ∈ (0,∞), α ∈ [0,∞) that

Γ(x+ α)

Γ(x)=

Γ(x+ bαc+ α− bαc)Γ(x)

≥ (maxx+ bαc+ α− bαc − 1, 0)α−bαc[

Γ(x+ bαc)Γ(x)

]= (maxx+ α− 1, 0)α−bαc

[bαc∏i=1

(x+ bαc − i)]

Γ(x)

Γ(x)

≥ (maxx+ α− 1, 0)α−bαcxbαc

= (maxx+ α− 1, 0)α−bαc(maxx, 0)bαc

≥ (maxx+ minα− 1, 0, 0)α−bαc(maxx+ minα− 1, 0, 0)bαc

= (maxx+ minα− 1, 0, 0)α.

(114)

Combining this with (113) shows (iv). The proof of Proposition 5.3 is thus complete.

26

Page 27: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Corollary 5.4. Let B : (0,∞)2 → (0,∞) satisfy for all x, y ∈ (0,∞) that B(x, y) =∫ 1

0tx−1(1 − t)y−1 dt and let Γ: (0,∞) → (0,∞) satisfy for all x ∈ (0,∞) that Γ(x) =∫∞

0tx−1e−t dt. Then it holds for all x, y ∈ (0,∞) with x+ y > 1 that

Γ(x)

(y + maxx− 1, 0)x≤ B(x, y) ≤ Γ(x)

(y + minx− 1, 0)x≤ max1, xxx(y + minx− 1, 0)x

.

(115)

Proof of Corollary 5.4. Note that (iii) in Lemma 5.1 ensures for all x, y ∈ (0,∞) that

B(x, y) =Γ(x)Γ(y)

Γ(y + x). (116)

In addition, observe that it holds for all x, y ∈ (0,∞) with x+y > 1 that y+minx−1, 0 >0. This and (iv) in Proposition 5.3 demonstrate for all x, y ∈ (0,∞) with x+ y > 1 that

0 < (y + minx− 1, 0)x ≤ Γ(y + x)

Γ(y)≤ (y + maxx− 1, 0)x. (117)

Combining this with (116) and (ii) in Proposition 5.3 shows for all x, y ∈ (0,∞) withx+ y > 1 that

Γ(x)

(y + maxx− 1, 0)x≤ B(x, y) ≤ Γ(x)

(y + minx− 1, 0)x≤ max1, xxx(y + minx− 1, 0)x

.

(118)The proof of Corollary 5.4 is thus complete.

5.2 Strong convergence rates for the optimisation error

Lemma 5.5. Let K ∈ N, p, L ∈ (0,∞), let (E, δ) be a metric space, let (Ω,F ,P) be aprobability space, let R : E ×Ω→ R be a (B(E)⊗F)/B(R)-measurable function, assumefor all x, y ∈ E, ω ∈ Ω that |R(x, ω) − R(y, ω)| ≤ Lδ(x, y), and let Xk : Ω → E,k ∈ 1, 2, . . . , K, be i.i.d. random variables. Then it holds for all x ∈ E that

E[mink∈1,2,...,K|R(Xk)−R(x)|p

]≤ Lp

∫ ∞0

[P(δ(X1, x) > ε1/p)]K dε. (119)

Proof of Lemma 5.5. Throughout this proof let x ∈ E and let Y : Ω → [0,∞) be thefunction which satisfies for all ω ∈ Ω that Y (ω) = mink∈1,2,...,K[δ(Xk(ω), x)]p. Ob-serve that the fact that Y is a random variable, the assumption that ∀x, y ∈ E, ω ∈Ω: |R(x, ω)−R(y, ω)| ≤ Lδ(x, y), and the complementary distribution function formula(see, e.g., Elbrachter et al. [35, Lemma 2.2]) demonstrate that

E[mink∈1,2,...,K|R(Xk)−R(x)|p

]≤ Lp E

[mink∈1,2,...,K[δ(Xk, x)]p

]= Lp E[Y ] = Lp

∫ ∞0

y PY (dy) = Lp∫ ∞

0

PY ((ε,∞)) dε

= Lp∫ ∞

0

P(Y > ε) dε = Lp∫ ∞

0

P(mink∈1,2,...,K[δ(Xk, x)]p > ε

)dε.

(120)

Moreover, the assumption that Θk, k ∈ 1, 2, . . . , K, are i.i.d. random variables showsfor all ε ∈ (0,∞) that

P(mink∈1,2,...,K[δ(Xk, x)]p > ε

)= P

(∀ k ∈ 1, 2, . . . , K : [δ(Xk, x)]p > ε

)=

K∏k=1

P([δ(Xk, x)]p > ε) = [P([δ(X1, x)]p > ε)]K = [P(δ(X1, x) > ε1/p)]K .

(121)

Combining (120) with (121) proves (119). The proof of Lemma 5.5 is thus complete.

27

Page 28: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proposition 5.6. Let d, K ∈ N, L, α ∈ R, β ∈ (α,∞), let (Ω,F ,P) be a probabilityspace, let R : [α, β]d × Ω → R be a random field, assume for all θ, ϑ ∈ [α, β]d, ω ∈Ω that |R(θ, ω) − R(ϑ, ω)| ≤ L‖θ − ϑ‖∞, let Θk : Ω → [α, β]d, k ∈ 1, 2, . . . , K, bei.i.d. random variables, and assume that Θ1 is continuous uniformly distributed on [α, β]d

(cf. Definition 3.1). Then

(i) it holds that R is a (B([α, β]d)⊗F)/B(R)-measurable function and

(ii) it holds for all θ ∈ [α, β]d, p ∈ (0,∞) that(E[mink∈1,2,...,K|R(Θk)−R(θ)|p

])1/p≤ L(β − α) max1, (p/d)1/d

K1/d≤ L(β − α) max1, p

K1/d.

(122)

Proof of Proposition 5.6. Throughout this proof assume w.l.o.g. that L > 0, let δ :([α, β]d) × ([α, β]d) → [0,∞) satisfy for all θ, ϑ ∈ [α, β]d that δ(θ, ϑ) = ‖θ − ϑ‖∞, let

B : (0,∞)2 → (0,∞) satisfy for all x, y ∈ (0,∞) that B(x, y) =∫ 1

0tx−1(1 − t)y−1 dt, and

let Θ1,1,Θ1,2, . . . ,Θ1,d : Ω → [α, β] satisfy Θ1 = (Θ1,1,Θ1,2, . . . ,Θ1,d). First of all, notethat the assumption that ∀ θ, ϑ ∈ [α, β]d, ω ∈ Ω: |R(θ, ω) − R(ϑ, ω)| ≤ L‖θ − ϑ‖∞ en-sures for all ω ∈ Ω that the function [α, β]d 3 θ 7→ R(θ, ω) ∈ R is continuous. Combiningthis with the fact that ([α, β]d, δ) is a separable metric space, the fact that for everyθ ∈ [α, β]d it holds that the function Ω 3 ω 7→ R(θ, ω) ∈ R is F/B(R)-measurable, and,e.g., Aliprantis & Border [1, Lemma 4.51] (see also, e.g., Beck et al. [8, Lemma 2.4])proves (i). Next observe that it holds for all θ ∈ [α, β], ε ∈ [0,∞) that

minθ + ε, β −maxθ − ε, α = minθ + ε, β+ minε− θ,−α= min

θ + ε+ minε− θ,−α, β + minε− θ,−α

= min

min2ε, θ − α + ε,minβ − θ + ε, β − α

≥ min

min2ε, α− α + ε,minβ − β + ε, β − α

= min2ε, ε, ε, β − α = minε, β − α.

(123)

The assumption that Θ1 is continuous uniformly distributed on [α, β]d hence shows forall θ = (θ1, θ2, . . . , θd) ∈ [α, β]d, ε ∈ [0,∞) that

P(‖Θ1 − θ‖∞ ≤ ε) = P(maxi∈1,2,...,d|Θ1,i − θi| ≤ ε

)= P

(∀ i ∈ 1, 2, . . . ,d : − ε ≤ Θ1,i − θi ≤ ε

)= P

(∀ i ∈ 1, 2, . . . ,d : θi − ε ≤ Θ1,i ≤ θi + ε

)= P

(∀ i ∈ 1, 2, . . . ,d : maxθi − ε, α ≤ Θ1,i ≤ minθi + ε, β

)= P

(Θ1 ∈

[×di=1[maxθi − ε, α,minθi + ε, β]

])= 1

(β−α)d

d∏i=1

(minθi + ε, β −maxθi − ε, α)

≥ 1(β−α)d

[minε, β − α]d = min

1, εd

(β−α)d

.

(124)

Therefore, we obtain for all θ ∈ [α, β]d, p ∈ (0,∞), ε ∈ [0,∞) that

P(‖Θ1 − θ‖∞ > ε1/p) = 1− P(‖Θ1 − θ‖∞ ≤ ε

1/p)

≤ 1−min

1, εd/p

(β−α)d

= max

0, 1− ε

d/p

(β−α)d

.

(125)

28

Page 29: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

This, (i), the assumption that ∀ θ, ϑ ∈ [α, β]d, ω ∈ Ω: |R(θ, ω)−R(ϑ, ω)| ≤ L‖θ − ϑ‖∞,the assumption that Θk, k ∈ 1, 2, . . . , K, are i.i.d. random variables, and Lemma 5.5(with (E, δ)← ([α, β]d, δ), (Xk)k∈1,2,...,K ← (Θk)k∈1,2,...,K in the notation of Lemma 5.5)establish for all θ ∈ [α, β]d, p ∈ (0,∞) that

E[mink∈1,2,...,K|R(Θk)−R(θ)|p

]≤ Lp

∫ ∞0

[P(‖Θ1 − θ‖∞ > ε1/p)]K dε

≤ Lp∫ ∞

0

[max

0, 1− ε

d/p

(β−α)d

]Kdε = Lp

∫ (β−α)p

0

(1− ε

d/p

(β−α)d

)Kdε

= pdLp(β − α)p

∫ 1

0

tp/d−1(1− t)K dt = p

dLp(β − α)p

∫ 1

0

tp/d−1(1− t)K+1−1 dt

= pdLp(β − α)p B(p/d, K + 1).

(126)

Corollary 5.4 (with x ← p/d, y ← K + 1 for p ∈ (0,∞) in the notation of (115) inCorollary 5.4) hence demonstrates for all θ ∈ [α, β]d, p ∈ (0,∞) that

E[mink∈1,2,...,K|R(Θk)−R(θ)|p

]≤

pdLp(β − α)p max1, (p/d)p/d

pd

(K + 1 + minp/d− 1, 0)p/d≤ Lp(β − α)p max1, (p/d)p/d

Kp/d.

(127)

This implies for all θ ∈ [α, β]d, p ∈ (0,∞) that(E[mink∈1,2,...,K|R(Θk)−R(θ)|p

])1/p≤ L(β − α) max1, (p/d)1/d

K1/d≤ L(β − α) max1, p

K1/d.

(128)

This shows (ii) and thus completes the proof of Proposition 5.6.

Lemma 5.7. Let d,d,L,M ∈ N, B, b ∈ [1,∞), u ∈ R, v ∈ (u,∞), l = (l0, l1, . . . , lL) ∈NL+1, D ⊆ [−b, b]d, assume l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let Ω be a set, let

Xj : Ω → D, j ∈ 1, 2, . . . ,M, and Yj : Ω → [u, v], j ∈ 1, 2, . . . ,M, be functions, andlet R : [−B,B]d × Ω→ [0,∞) satisfy for all θ ∈ [−B,B]d, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

](129)

(cf. Definition 2.8). Then it holds for all θ, ϑ ∈ [−B,B]d, ω ∈ Ω that

|R(θ, ω)−R(ϑ, ω)| ≤ 2(v − u)bL(‖l‖∞ + 1)LBL−1‖θ − ϑ‖∞ (130)

(cf. Definition 3.1).

Proof of Lemma 5.7. Observe that the fact that ∀x1, x2, y ∈ R : (x1 − y)2 − (x2 − y)2 =(x1 − x2)((x1 − y) + (x2 − y)), the fact that ∀ θ ∈ Rd, x ∈ Rd : N θ,l

u,v (x) ∈ [u, v], and theassumption that ∀ j ∈ 1, 2, . . . ,M, ω ∈ Ω: Yj(ω) ∈ [u, v] prove for all θ, ϑ ∈ [−B,B]d,

29

Page 30: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

ω ∈ Ω that

|R(θ, ω)−R(ϑ, ω)|

=1

M

∣∣∣∣[ M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]−[M∑j=1

|N ϑ,lu,v (Xj(ω))− Yj(ω)|2

]∣∣∣∣≤ 1

M

[M∑j=1

∣∣[N θ,lu,v (Xj(ω))− Yj(ω)]2 − [N ϑ,l

u,v (Xj(ω))− Yj(ω)]2∣∣]

=1

M

[M∑j=1

(∣∣N θ,lu,v (Xj(ω))−N ϑ,l

u,v (Xj(ω))∣∣

·∣∣[N θ,l

u,v (Xj(ω))− Yj(ω)] + [N ϑ,lu,v (Xj(ω))− Yj(ω)]

∣∣)]≤ 2

M

[M∑j=1

([supx∈D|N θ,l

u,v (x)−N ϑ,lu,v (x)|

][supy1,y2∈[u,v]|y1 − y2|

])]= 2(v − u)

[supx∈D|N θ,l

u,v (x)−N ϑ,lu,v (x)|

].

(131)

In addition, combining the assumptions that D ⊆ [−b, b]d, d ≥∑L

i=1 li(li−1 + 1), l0 = d,lL = 1, b ≥ 1, and B ≥ 1 with Beck, Jentzen, & Kuckuck [10, Corollary 2.37] (witha← −b, b← b, u← u, v ← v, d← d, L← L, l← l in the notation of [10, Corollary 2.37])shows for all θ, ϑ ∈ [−B,B]d that

supx∈D|N θ,lu,v (x)−N ϑ,l

u,v (x)| ≤ supx∈[−b,b]d |N θ,lu,v (x)−N ϑ,l

u,v (x)|≤ L max1, b(‖l‖∞ + 1)L(max1, ‖θ‖∞, ‖ϑ‖∞)L−1‖θ − ϑ‖∞≤ bL(‖l‖∞ + 1)LBL−1‖θ − ϑ‖∞.

(132)

This and (131) imply for all θ, ϑ ∈ [−B,B]d, ω ∈ Ω that

|R(θ, ω)−R(ϑ, ω)| ≤ 2(v − u)bL(‖l‖∞ + 1)LBL−1‖θ − ϑ‖∞. (133)

The proof of Lemma 5.7 is thus complete.

Corollary 5.8. Let d,d, d,L,M,K ∈ N, B, b ∈ [1,∞), u ∈ R, v ∈ (u,∞), l =(l0, l1, . . . , lL) ∈ NL+1, D ⊆ [−b, b]d, assume l0 = d, lL = 1, and d ≥ d =

∑Li=1 li(li−1 +1),

let (Ω,F ,P) be a probability space, let Θk : Ω → [−B,B]d, k ∈ 1, 2, . . . , K, be i.i.d.random variables, assume that Θ1 is continuous uniformly distributed on [−B,B]d, letXj : Ω → D, j ∈ 1, 2, . . . ,M, and Yj : Ω → [u, v], j ∈ 1, 2, . . . ,M, be random vari-ables, and let R : [−B,B]d × Ω→ [0,∞) satisfy for all θ ∈ [−B,B]d, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

](134)

(cf. Definition 2.8). Then

(i) it holds that R is a (B([−B,B]d)⊗F)/B([0,∞))-measurable function and

(ii) it holds for all θ ∈ [−B,B]d, p ∈ (0,∞) that(E[mink∈1,2,...,K|R(Θk)−R(θ)|p

])1/p(135)

≤4(v − u)bL(‖l‖∞ + 1)LBL

√max1, p/d

K1/d≤ 4(v − u)bL(‖l‖∞ + 1)LBL max1, p

K [L−1(‖l‖∞+1)−2]

(cf. Definition 3.1).

30

Page 31: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proof of Corollary 5.8. Throughout this proof let L ∈ R be given by L = 2(v−u)bL(‖l‖∞+1)LBL−1, let P : [−B,B]d → [−B,B]d satisfy for all θ = (θ1, θ2, . . . , θd) ∈ [−B,B]d thatP (θ) = (θ1, θ2, . . . , θd), and let R : [−B,B]d × Ω → R satisfy for all θ ∈ [−B,B]d, ω ∈ Ωthat

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]. (136)

Note that the fact that ∀ θ ∈ [−B,B]d : N θ,lu,v = N P (θ),l

u,v implies for all θ ∈ [−B,B]d,ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]=

1

M

[M∑j=1

|N P (θ),lu,v (Xj(ω))− Yj(ω)|2

]= R(P (θ), ω).

(137)

Furthermore, Lemma 5.7 (with d← d, R ← ([−B,B]d × Ω 3 (θ, ω) 7→ R(θ, ω) ∈ [0,∞))in the notation of Lemma 5.7) demonstrates for all θ, ϑ ∈ [−B,B]d, ω ∈ Ω that

|R(θ, ω)−R(ϑ, ω)| ≤ 2(v − u)bL(‖l‖∞ + 1)LBL−1‖θ − ϑ‖∞ = L‖θ − ϑ‖∞. (138)

Moreover, observe that the assumption that Xj, j ∈ 1, 2, . . . ,M, and Yj, j ∈ 1, 2, . . . ,M, are random variables ensures that R : [−B,B]d × Ω → R is a random field. This,(138), the fact that P Θk : Ω→ [−B,B]d, k ∈ 1, 2, . . . , K, are i.i.d. random variables,the fact that P Θ1 is continuous uniformly distributed on [−B,B]d, and Proposition 5.6(with d ← d, α ← −B, β ← B, R ← R, (Θk)k∈1,2,...,K ← (P Θk)k∈1,2,...,K inthe notation of Proposition 5.6) prove for all θ ∈ [−B,B]d, p ∈ (0,∞) that R is a(B([−B,B]d)⊗F)/B(R)-measurable function and(

E[mink∈1,2,...,K|R(P (Θk))−R(P (θ))|p

])1/p≤ L(2B) max1, (p/d)1/d

K1/d=

4(v − u)bL(‖l‖∞ + 1)LBL max1, (p/d)1/dK1/d

.(139)

The fact that P is a B([−B,B]d)/B([−B,B]d)-measurable function and (137) henceshow (i). In addition, (137), (139), and the fact that 2 ≤ d =

∑Li=1 li(li−1 + 1) ≤

L(‖l‖∞ + 1)2 yield for all θ ∈ [−B,B]d, p ∈ (0,∞) that(E[mink∈1,2,...,K|R(Θk)−R(θ)|p

])1/p=(E[mink∈1,2,...,K|R(P (Θk))−R(P (θ))|p

])1/p(140)

≤4(v − u)bL(‖l‖∞ + 1)LBL

√max1, p/d

K1/d≤ 4(v − u)bL(‖l‖∞ + 1)LBL max1, p

K [L−1(‖l‖∞+1)−2].

This establishes (ii). The proof of Corollary 5.8 is thus complete.

6 Analysis of the overall error

In Subsection 6.2 below we present the main result of this article, Theorem 6.5, thatprovides an estimate for the overall L2-error arising in deep learning based empiricalrisk minimisation with quadratic loss function in the probabilistically strong sense andthat covers the case where the underlying DNNs are trained using a general stochasticoptimisation algorithm with random initialisation.

31

Page 32: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

In order to prove Theorem 6.5, we require a link to combine the results from Sections 3,4, and 5, which is given in Subsection 6.1 below. More specifically, Proposition 6.1 inSubsection 6.1 shows that the overall error can be decomposed into three different errorsources: the approximation error (cf. Section 3), the worst-case generalisation error (cf.Section 4), and the optimisation error (cf. Section 5). Proposition 6.1 is a consequenceof the well-known bias–variance decomposition (cf., e.g., Beck, Jentzen, & Kuckuck [10,Lemma 4.1] or Berner, Grohs, & Jentzen [13, Lemma 2.2]) and is very similar to [10,Lemma 4.3].

Thereafter, Subsection 6.2 is devoted to strong convergence results for deep learningbased empirical risk minimisation with quadratic loss function where a general stochasticapproximation algorithm with random initialisation is allowed to be the employed op-timisation method. Apart from the main result (cf. Theorem 6.5), Subsection 6.2 alsoincludes on the one hand Proposition 6.3, which combines the overall error decompo-sition (cf. Proposition 6.1) with our convergence result for the generalisation error (cf.Corollary 4.15 in Section 4) and our convergence result for the optimisation error (cf.Corollary 5.8 in Section 5), and on the other hand Corollary 6.6, which replaces the ar-chitecture parameter A ∈ (0,∞) in Theorem 6.5 (cf. Proposition 3.5) by the minimum ofthe depth parameter L ∈ N and the hidden layer sizes l1, l2, . . . , lL−1 ∈ N of the trainedDNN (cf. (178) below).

Finally, in Subsection 6.3 we present three more strong convergence results for the spe-cial case where SGD with random initialisation is the employed optimisation method. Inparticular, Corollary 6.7 specifies Corollary 6.6 to this special case, Corollary 6.8 providesa convergence estimate for the expectation of the L1-distance between the trained DNNand the target function, and Corollary 6.9 reaches an analogous conclusion in a simplifiedsetting.

6.1 Overall error decomposition

Proposition 6.1. Let d,d,L,M,K,N ∈ N, B ∈ [0,∞), u ∈ R, v ∈ (u,∞), l =(l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . , N, D ⊆ Rd, assume 0 ∈ N, l0 = d, lL =1, and d ≥

∑Li=1 li(li−1 + 1), let (Ω,F ,P) be a probability space, let Xj : Ω → D,

j ∈ 1, 2, . . . ,M, and Yj : Ω → [u, v], j ∈ 1, 2, . . . ,M, be random variables, letE : D → [u, v] be a B(D)/B([u, v])-measurable function, assume that it holds P-a.s. thatE(X1) = E[Y1|X1], let Θk,n : Ω → Rd, k, n ∈ N0, satisfy

(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d,

let R : Rd → [0,∞) satisfy for all θ ∈ Rd that R(θ) = E[|N θ,lu,v (X1) − Y1|2], and let

R : Rd × Ω→ [0,∞) and k : Ω→ (N0)2 satisfy for all θ ∈ Rd, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]and (141)

k(ω) ∈ arg min(k,n)∈1,2,...,K×N, ‖Θk,n(ω)‖∞≤BR(Θk,n(ω), ω) (142)

(cf. Definitions 2.8 and 3.1). Then it holds for all ϑ ∈ [−B,B]d that∫D

|N Θk,lu,v (x)− E(x)|2 PX1(dx)

≤[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2[supθ∈[−B,B]d|R(θ)−R(θ)|

]+ min(k,n)∈1,2,...,K×N, ‖Θk,n‖∞≤B|R(Θk,n)−R(ϑ)|.

(143)

Proof of Proposition 6.1. Throughout this proof let R : L2(PX1 ;R) → [0,∞) satisfy forall f ∈ L2(PX1 ;R) that R(f) = E[|f(X1) − Y1|2]. Observe that the assumption that

32

Page 33: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

∀ω ∈ Ω: Y1(ω) ∈ [u, v] and the fact that ∀ θ ∈ Rd, x ∈ Rd : N θ,lu,v (x) ∈ [u, v] ensure for all

θ ∈ Rd that E[|Y1|2] ≤ v2 <∞ and∫D

|N θ,lu,v (x)|2 PX1(dx) = E

[|N θ,l

u,v (X1)|2]≤ v2 <∞. (144)

The bias–variance decomposition (cf., e.g., Beck, Jentzen, & Kuckuck [10, (iii) in Lem-ma 4.1] with (Ω,F ,P) ← (Ω,F ,P), (S,S) ← (D,B(D)), X ← X1, Y ← (Ω 3 ω 7→Y1(ω) ∈ R), E ← R, f ← N θ,l

u,v |D, g ← N ϑ,lu,v |D for θ, ϑ ∈ Rd in the notation of [10, (iii)

in Lemma 4.1]) hence proves for all θ, ϑ ∈ Rd that∫D

|N θ,lu,v (x)− E(x)|2 PX1(dx)

= E[|N θ,l

u,v (X1)− E(X1)|2]

= E[|N θ,l

u,v (X1)− E[Y1|X1]|2]

= E[|N ϑ,l

u,v (X1)− E[Y1|X1]|2]

+ R(N θ,lu,v |D)−R(N ϑ,l

u,v |D) (145)

= E[|N ϑ,l

u,v (X1)− E(X1)|2]

+ E[|N θ,l

u,v (X1)− Y1|2]− E

[|N ϑ,l

u,v (X1)− Y1|2]

=

∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx) + R(θ)−R(ϑ).

This implies for all θ, ϑ ∈ Rd that∫D

|N θ,lu,v (x)− E(x)|2 PX1(dx)

=

∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx)− [R(θ)−R(θ)] +R(ϑ)−R(ϑ) +R(θ)−R(ϑ)

≤∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx) + |R(θ)−R(θ)|+ |R(ϑ)−R(ϑ)|+R(θ)−R(ϑ)

≤∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx) + 2

[maxη∈θ,ϑ|R(η)−R(η)|

]+R(θ)−R(ϑ). (146)

Next note that the fact that ∀ω ∈ Ω: ‖Θk(ω)(ω)‖∞ ≤ B ensures for all ω ∈ Ω thatΘk(ω)(ω) ∈ [−B,B]d. Combining (146) with (142) hence establishes for all ϑ ∈ [−B,B]d

that∫D

|N Θk,lu,v (x)− E(x)|2 PX1(dx)

≤∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx) + 2

[supθ∈[−B,B]d|R(θ)−R(θ)|

]+R(Θk)−R(ϑ)

=

∫D

|N ϑ,lu,v (x)− E(x)|2 PX1(dx) + 2

[supθ∈[−B,B]d|R(θ)−R(θ)|

]+ min(k,n)∈1,2,...,K×N, ‖Θk,n‖∞≤B[R(Θk,n)−R(ϑ)]

≤[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2[supθ∈[−B,B]d|R(θ)−R(θ)|

]+ min(k,n)∈1,2,...,K×N, ‖Θk,n‖∞≤B|R(Θk,n)−R(ϑ)|.

(147)

The proof of Proposition 6.1 is thus complete.

6.2 Full strong error analysis for the training of DNNs

Lemma 6.2. Let d,d,L ∈ N, p ∈ [0,∞), u ∈ [−∞,∞), v ∈ (u,∞], l = (l0, l1, . . . , lL) ∈NL+1, D ⊆ Rd, assume l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let E : D → R be a

B(D)/B(R)-measurable function, let (Ω,F ,P) be a probability space, and let X : Ω→ D,k : Ω→ (N0)2, and Θk,n : Ω→ Rd, k, n ∈ N0, be random variables. Then

33

Page 34: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

(i) it holds that the function Rd×Rd 3 (θ, x) 7→ N θ,lu,v (x) ∈ R is (B(Rd)⊗B(Rd))/B(R)-

measurable,

(ii) it holds that the function Ω 3 ω 7→ Θk(ω)(ω) ∈ Rd is F/B(Rd)-measurable, and

(iii) it holds that the function

Ω 3 ω 7→∫D

|N Θk(ω)(ω),lu,v (x)− E(x)|p PX(dx) ∈ [0,∞] (148)

is F/B([0,∞])-measurable

(cf. Definition 2.8).

Proof of Lemma 6.2. First, observe that Beck, Jentzen, & Kuckuck [10, Corollary 2.37](with a ← −‖x‖∞, b ← ‖x‖∞, u ← u, v ← v, d ← d, L ← L, l ← l for x ∈ Rd in thenotation of [10, Corollary 2.37]) demonstrates for all x ∈ Rd, θ, ϑ ∈ Rd that

|N θ,lu,v (x)−N ϑ,l

u,v (x)| ≤ supy∈[−‖x‖∞,‖x‖∞]l0 |N θ,lu,v (y)−N ϑ,l

u,v (y)|≤ L max1, ‖x‖∞(‖l‖∞ + 1)L(max1, ‖θ‖∞, ‖ϑ‖∞)L−1‖θ − ϑ‖∞

(149)

(cf. Definition 3.1). This implies for all x ∈ Rd that the function

Rd 3 θ 7→ N θ,lu,v (x) ∈ R (150)

is continuous. In addition, the fact that ∀ θ ∈ Rd : N θ,lu,v ∈ C(Rd,R) ensures for all θ ∈ Rd

that the function Rd 3 x 7→ N θ,lu,v (x) ∈ R is B(Rd)/B(R)-measurable. This, (150), the fact

that (Rd, ‖·‖∞|Rd) is a separable normed R-vector space, and, e.g., Aliprantis & Border [1,Lemma 4.51] (see also, e.g., Beck et al. [8, Lemma 2.4]) show (i).

Second, we prove (ii). For this let Ξ: Ω → Rd satisfy for all ω ∈ Ω that Ξ(ω) =Θk(ω)(ω). Observe that the assumption that Θk,n : Ω→ Rd, k, n ∈ N0, and k : Ω→ (N0)2

are random variables establishes for all U ∈ B(Rd) that

Ξ−1(U) = ω ∈ Ω: Ξ(ω) ∈ U = ω ∈ Ω: Θk(ω)(ω) ∈ U=ω ∈ Ω:

[∃ k, n ∈ N0 : ([Θk,n(ω) ∈ U ] ∧ [k(ω) = (k, n)])

]=∞⋃k=0

∞⋃n=0

(ω ∈ Ω: Θk,n(ω) ∈ U ∩ ω ∈ Ω: k(ω) = (k, n)

)=∞⋃k=0

∞⋃n=0

([(Θk,n)−1(U)] ∩ [k−1((k, n))]

)∈ F .

(151)

This implies (ii).Third, note that (i)–(ii) yield that the function Ω×Rd 3 (ω, x) 7→ N Θk(ω)(ω),l

u,v (x) ∈ R is(F ⊗B(Rd))/B(R)-measurable. This and the assumption that E : D → R is B(D)/B(R)-measurable demonstrate that the function Ω × D 3 (ω, x) 7→ |N Θk(ω)(ω),l

u,v (x) − E(x)|p ∈[0,∞) is (F⊗B(D))/B([0,∞))-measurable. Tonelli’s theorem hence establishes (iii). Theproof of Lemma 6.2 is thus complete.

Proposition 6.3. Let d,d,L,M,K,N ∈ N, b, c ∈ [1,∞), B ∈ [c,∞), u ∈ R, v ∈ (u,∞),l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . , N, D ⊆ [−b, b]d, assume 0 ∈ N, l0 = d,lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let (Ω,F ,P) be a probability space, let Xj : Ω → D,

j ∈ N, and Yj : Ω → [u, v], j ∈ N, be functions, assume that (Xj, Yj), j ∈ 1, 2, . . . ,M,are i.i.d. random variables, let E : D → [u, v] be a B(D)/B([u, v])-measurable function,

34

Page 35: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

assume that it holds P-a.s. that E(X1) = E[Y1|X1], let Θk,n : Ω → Rd, k, n ∈ N0, andk : Ω → (N0)2 be random variables, assume

(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, assume that

Θk,0, k ∈ 1, 2, . . . , K, are i.i.d., assume that Θ1,0 is continuous uniformly distributed on[−c, c]d, and let R : Rd × Ω→ [0,∞) satisfy for all θ ∈ Rd, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]and (152)

k(ω) ∈ arg min(k,n)∈1,2,...,K×N, ‖Θk,n(ω)‖∞≤BR(Θk,n(ω), ω) (153)

(cf. Definitions 2.8 and 3.1). Then it holds for all p ∈ (0,∞) that(E[(∫

D|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤[infθ∈[−c,c]d supx∈D|N θ,l

u,v (x)− E(x)|2]

+4(v − u)bL(‖l‖∞ + 1)LcL max1, p

K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√

M

≤[infθ∈[−c,c]d supx∈D|N θ,l

u,v (x)− E(x)|2]

+20 max1, (v − u)2bL(‖l‖∞ + 1)L+1BL maxp, ln(3M)

min√M,K [L−1(‖l‖∞+1)−2]

(154)

(cf. (iii) in Lemma 6.2).

Proof of Proposition 6.3. Throughout this proof let R : Rd → [0,∞) satisfy for all θ ∈Rd that R(θ) = E[|N θ,l

u,v (X1) − Y1|2]. First of all, observe that the assumption that(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, the assumption that 0 ∈ N, and Proposition 6.1 show for all

ϑ ∈ [−B,B]d that∫D

|N Θk,lu,v (x)− E(x)|2 PX1(dx)

≤[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2[supθ∈[−B,B]d|R(θ)−R(θ)|

]+ min(k,n)∈1,2,...,K×N, ‖Θk,n‖∞≤B|R(Θk,n)−R(ϑ)|≤[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2[supθ∈[−B,B]d|R(θ)−R(θ)|

]+ mink∈1,2,...,K, ‖Θk,0‖∞≤B|R(Θk,0)−R(ϑ)|

=[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2[supθ∈[−B,B]d |R(θ)−R(θ)|

]+ mink∈1,2,...,K|R(Θk,0)−R(ϑ)|.

(155)

Minkowski’s inequality hence establishes for all p ∈ [1,∞), ϑ ∈ [−c, c]d ⊆ [−B,B]d that(E[(∫

D|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤(E[supx∈D|N ϑ,l

u,v (x)− E(x)|2p])1/p

+ 2(E[supθ∈[−B,B]d|R(θ)−R(θ)|p

])1/p+(E[mink∈1,2,...,K|R(Θk,0)−R(ϑ)|p

])1/p≤[supx∈D|N ϑ,l

u,v (x)− E(x)|2]

+ 2(E[supθ∈[−B,B]d|R(θ)−R(θ)|p

])1/p+ supθ∈[−c,c]d

(E[mink∈1,2,...,K|R(Θk,0)−R(θ)|p

])1/p(156)

(cf. (i) in Corollary 4.15 and (i) in Corollary 5.8). Next note that Corollary 4.15 (withv ← maxu + 1, v, R ← R|[−B,B]d , R ← R|[−B,B]d×Ω in the notation of Corollary 4.15)

35

Page 36: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

proves for all p ∈ (0,∞) that(E[supθ∈[−B,B]d|R(θ)−R(θ)|p

])1/p≤ 9(maxu+ 1, v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√

M

=9 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√

M.

(157)

In addition, observe that Corollary 5.8 (with d←∑L

i=1 li(li−1 +1), B ← c, (Θk)k∈1,2,...,K← (Ω 3 ω 7→ 1Θk,0∈[−c,c]d(ω)Θk,0(ω) ∈ [−c, c]d)k∈1,2,...,K, R ← R|[−c,c]d×Ω in the nota-tion of Corollary 5.8) implies for all p ∈ (0,∞) that

supθ∈[−c,c]d(E[mink∈1,2,...,K|R(Θk,0)−R(θ)|p

])1/p= supθ∈[−c,c]d

(E[mink∈1,2,...,K|R(1Θk,0∈[−c,c]dΘk,0)−R(θ)|p

])1/p≤ 4(v − u)bL(‖l‖∞ + 1)LcL max1, p

K [L−1(‖l‖∞+1)−2].

(158)

Combining this, (156), (157), and the fact that ln(3MBb) ≥ 1 with Jensen’s inequalitydemonstrates for all p ∈ (0,∞) that(

E[(∫

D|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤(E[(∫

D|N Θk,l

u,v (x)− E(x)|2 PX1(dx))max1,p

]) 1max1,p

≤[infθ∈[−c,c]d supx∈D|N θ,l

u,v (x)− E(x)|2]

+ supθ∈[−c,c]d(E[mink∈1,2,...,K|R(Θk,0)−R(θ)|max1,p]) 1

max1,p

+ 2(E[supθ∈[−B,B]d |R(θ)−R(θ)|max1,p]) 1

max1,p

≤[infθ∈[−c,c]d supx∈D|N θ,l

u,v (x)− E(x)|2]

+4(v − u)bL(‖l‖∞ + 1)LcL max1, p

K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√

M.

(159)

Moreover, note that the fact that ∀x ∈ [0,∞) : x+ 1 ≤ ex ≤ 3x and the facts that Bb ≥ 1and M ≥ 1 ensure that

ln(3MBb) ≤ ln(3M3Bb−1) = ln(3BbM) = Bb ln([3BbM ]1/(Bb)) ≤ Bb ln(3M). (160)

The facts that ‖l‖∞ + 1 ≥ 2, B ≥ c ≥ 1, ln(3M) ≥ 1, b ≥ 1, and L ≥ 1 hence show forall p ∈ (0,∞) that

4(v − u)bL(‖l‖∞ + 1)LcL max1, pK [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBb)√

M

≤ 2(‖l‖∞ + 1) max1, (v − u)2bL(‖l‖∞ + 1)LBL maxp, ln(3M)K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2bL(‖l‖∞ + 1)2Bmaxp, ln(3M)√

M

≤ 20 max1, (v − u)2bL(‖l‖∞ + 1)L+1BL maxp, ln(3M)min

√M,K [L−1(‖l‖∞+1)−2]

.

(161)

36

Page 37: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

This and (159) complete the proof of Proposition 6.3.

Lemma 6.4. Let a, x, p ∈ (0,∞), M, c ∈ [1,∞), B ∈ [c,∞). Then

(i) it holds that axp ≤ exp(a1/p px

e

)and

(ii) it holds that ln(3MBc) ≤ 23B18

ln(eM).

Proof of Lemma 6.4. First, note that the fact that ∀ y ∈ R : y+1 ≤ ey demonstrates that

axp = (a1/px)p =

[e(a

1/p xe− 1 + 1

)]p ≤ [e exp(a

1/p xe− 1)]p

= exp(a

1/p pxe

). (162)

This proves (i).Second, observe that (i) and the fact that 2

√3/e ≤ 23/18 ensure that

3B2 ≤ exp(√

32Be

)= exp

(2√

3Be

)≤ exp

(23B18

). (163)

The facts that B ≥ c ≥ 1 and M ≥ 1 hence imply that

ln(3MBc) ≤ ln(3B2M) ≤ ln([eM ]23B/18) = 23B

18ln(eM). (164)

This establishes (ii). The proof of Lemma 6.4 is thus complete.

Theorem 6.5. Let d,d,L,M,K,N ∈ N, A ∈ (0,∞), L, a, u ∈ R, b ∈ (a,∞), v ∈(u,∞), c ∈ [max1, L, |a|, |b|, 2|u|, 2|v|,∞), B ∈ [c,∞), l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆0, 1, . . . , N, assume 0 ∈ N, L ≥ A1(6d,∞)(A)/(2d)+1, l0 = d, l1 ≥ A1(6d,∞)(A), lL = 1, andd ≥

∑Li=1 li(li−1 + 1), assume for all i ∈ 2, 3, . . . ∩ [0,L) that li ≥ 1(6d,∞)(A) maxA/d−

2i + 3, 2, let (Ω,F ,P) be a probability space, let Xj : Ω → [a, b]d, j ∈ N, and Yj : Ω →[u, v], j ∈ N, be functions, assume that (Xj, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. randomvariables, let E : [a, b]d → [u, v] satisfy P-a.s. that E(X1) = E[Y1|X1], assume for allx, y ∈ [a, b]d that |E(x) − E(y)| ≤ L‖x − y‖1, let Θk,n : Ω → Rd, k, n ∈ N0, and k : Ω →(N0)2 be random variables, assume

(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, assume that Θk,0, k ∈

1, 2, . . . , K, are i.i.d., assume that Θ1,0 is continuous uniformly distributed on [−c, c]d,and let R : Rd × Ω→ [0,∞) satisfy for all θ ∈ Rd, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]and (165)

k(ω) ∈ arg min(k,n)∈1,2,...,K×N, ‖Θk,n(ω)‖∞≤BR(Θk,n(ω), ω) (166)

(cf. Definitions 2.8 and 3.1). Then it holds for all p ∈ (0,∞) that(E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤ 9d2L2(b− a)2

A2/d+

4(v − u)L(‖l‖∞ + 1)LcL+1 max1, pK [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBc)√

M

≤ 36d2c4

A2/d+

4L(‖l‖∞ + 1)LcL+2 max1, pK [L−1(‖l‖∞+1)−2]

+23B3L(‖l‖∞ + 1)2 maxp, ln(eM)√

M

(167)

(cf. (iii) in Lemma 6.2).

37

Page 38: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Proof of Theorem 6.5. First of all, note that the assumption that ∀x, y ∈ [a, b]d : |E(x)−E(y)| ≤ L‖x − y‖1 ensures that E : [a, b]d → [u, v] is a B([a, b]d)/B([u, v])-measurablefunction. The fact that max1, |a|, |b| ≤ c and Proposition 6.3 (with b← max1, |a|, |b|,D ← [a, b]d in the notation of Proposition 6.3) hence show for all p ∈ (0,∞) that(

E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤[infθ∈[−c,c]d supx∈[a,b]d|N θ,l

u,v (x)− E(x)|2]

+4(v − u) max1, |a|, |b|L(‖l‖∞ + 1)LcL max1, p

K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBmax1, |a|, |b|)√

M(168)

≤[infθ∈[−c,c]d supx∈[a,b]d|N θ,l

u,v (x)− E(x)|2]

+4(v − u)L(‖l‖∞ + 1)LcL+1 max1, p

K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBc)√

M.

Furthermore, observe that Proposition 3.5 (with f ← E in the notation of Proposition 3.5)proves that there exists ϑ ∈ Rd such that ‖ϑ‖∞ ≤ max1, L, |a|, |b|, 2[supx∈[a,b]d |E(x)|]and

supx∈[a,b]d |N ϑ,lu,v (x)− E(x)| ≤ 3dL(b− a)

A1/d. (169)

The fact that ∀x ∈ [a, b]d : E(x) ∈ [u, v] hence implies that

‖ϑ‖∞ ≤ max1, L, |a|, |b|, 2|u|, 2|v| ≤ c. (170)

This and (169) demonstrate that

infθ∈[−c,c]d supx∈[a,b]d|N θ,lu,v (x)− E(x)|2

≤ supx∈[a,b]d |N ϑ,lu,v (x)− E(x)|2

≤[

3dL(b− a)

A1/d

]2

=9d2L2(b− a)2

A2/d.

(171)

Combining this with (168) establishes for all p ∈ (0,∞) that(E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p ])1/p

≤ 9d2L2(b− a)2

A2/d+

4(v − u)L(‖l‖∞ + 1)LcL+1 max1, pK [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBc)√

M.

(172)

Moreover, note that the facts that max1, L, |a|, |b| ≤ c and (b − a)2 ≤ (|a| + |b|)2 ≤2(a2 + b2) yield that

9L2(b− a)2 ≤ 18c2(a2 + b2) ≤ 18c2(c2 + c2) = 36c4. (173)

In addition, the fact that B ≥ c ≥ 1, the fact that M ≥ 1, and (ii) in Lemma 6.4 ensurethat ln(3MBc) ≤ 23B

18ln(eM). This, (173), the fact that (v − u) ≤ 2 max|u|, |v| =

38

Page 39: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

max2|u|, 2|v| ≤ c ≤ B, and the fact that B ≥ 1 prove for all p ∈ (0,∞) that

9d2L2(b− a)2

A2/d+

4(v − u)L(‖l‖∞ + 1)LcL+1 max1, pK [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp, ln(3MBc)√

M

≤ 36d2c4

A2/d+

4L(‖l‖∞ + 1)LcL+2 max1, pK [L−1(‖l‖∞+1)−2]

+23B3L(‖l‖∞ + 1)2 maxp, ln(eM)√

M.

(174)

Combining this with (172) shows (167). The proof of Theorem 6.5 is thus complete.

Corollary 6.6. Let d,d,L,M,K,N ∈ N, L, a, u ∈ R, b ∈ (a,∞), v ∈ (u,∞), c ∈[max1, L, |a|, |b|, 2|u|, 2|v|,∞), B ∈ [c,∞), l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . ,N, assume 0 ∈ N, l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let (Ω,F ,P) be a

probability space, let Xj : Ω → [a, b]d, j ∈ N, and Yj : Ω → [u, v], j ∈ N, be functions,assume that (Xj, Yj), j ∈ 1, 2, . . . ,M, are i.i.d. random variables, let E : [a, b]d → [u, v]satisfy P-a.s. that E(X1) = E[Y1|X1], assume for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤L‖x− y‖1, let Θk,n : Ω→ Rd, k, n ∈ N0, and k : Ω→ (N0)2 be random variables, assume(⋃∞

k=1 Θk,0(Ω))⊆ [−B,B]d, assume that Θk,0, k ∈ 1, 2, . . . , K, are i.i.d., assume that

Θ1,0 is continuous uniformly distributed on [−c, c]d, and let R : Rd × Ω → [0,∞) satisfyfor all θ ∈ Rd, ω ∈ Ω that

R(θ, ω) =1

M

[M∑j=1

|N θ,lu,v (Xj(ω))− Yj(ω)|2

]and (175)

k(ω) ∈ arg min(k,n)∈1,2,...,K×N, ‖Θk,n(ω)‖∞≤BR(Θk,n(ω), ω) (176)

(cf. Definitions 2.8 and 3.1). Then it holds for all p ∈ (0,∞) that(E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p/2 ])1/p

≤ 3dL(b− a)

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2[(v − u)L(‖l‖∞ + 1)LcL+1 max1, p/2]1/2

K [(2L)−1(‖l‖∞+1)−2]

+3 max1, v − u(‖l‖∞ + 1)[L maxp, 2 ln(3MBc)]1/2

M 1/4(177)

≤ 6dc2

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2L(‖l‖∞ + 1)LcL+1 max1, pK [(2L)−1(‖l‖∞+1)−2]

+5B2L(‖l‖∞ + 1) maxp, ln(eM)

M 1/4

(cf. (iii) in Lemma 6.2).

Proof of Corollary 6.6. Throughout this proof let A ∈ (0,∞) be given by

A = min(L ∪ li : i ∈ N ∩ [0,L)). (178)

Note that (178) ensures that

L ≥ A = A− 1 + 1 ≥ (A− 1)1[2,∞)(A) + 1

≥(A− A

2

)1[2,∞)(A) + 1 =

A1[2,∞)(A)

2+ 1 ≥ A1(6d,∞)(A)

2d+ 1.

(179)

Moreover, the assumption that lL = 1 and (178) imply that

l1 = l111(L) + l11[2,∞)(L) ≥ 11(L) + A1[2,∞)(L) = A ≥ A1(6d,∞)(A). (180)

39

Page 40: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Moreover, again (178) shows for all i ∈ 2, 3, . . . ∩ [0,L) that

li ≥ A ≥ A1[2,∞)(A) ≥ 1[2,∞)(A) maxA− 1, 2 = 1[2,∞)(A) maxA− 4 + 3, 2≥ 1[2,∞)(A) maxA− 2i+ 3, 2 ≥ 1(6d,∞)(A) maxA/d− 2i+ 3, 2.

(181)

Combining (179)–(181) and Theorem 6.5 (with p ← p/2 for p ∈ (0,∞) in the notation ofTheorem 6.5) establishes for all p ∈ (0,∞) that(

E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p/2 ])2/p

≤ 9d2L2(b− a)2

A2/d+

4(v − u)L(‖l‖∞ + 1)LcL+1 max1, p/2K [L−1(‖l‖∞+1)−2]

+18 max1, (v − u)2L(‖l‖∞ + 1)2 maxp/2, ln(3MBc)√

M(182)

≤ 36d2c4

A2/d+

4L(‖l‖∞ + 1)LcL+2 max1, p/2K [L−1(‖l‖∞+1)−2]

+23B3L(‖l‖∞ + 1)2 maxp/2, ln(eM)√

M.

This, (178), and the facts that L ≥ 1, c ≥ 1, B ≥ 1, and ln(eM) ≥ 1 demonstrate for allp ∈ (0,∞) that(

E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX1(dx))p/2 ])1/p

≤ 3dL(b− a)

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2[(v − u)L(‖l‖∞ + 1)LcL+1 max1, p/2]1/2

K [(2L)−1(‖l‖∞+1)−2]

+3 max1, v − u(‖l‖∞ + 1)[L maxp, 2 ln(3MBc)]1/2

M 1/4

≤ 6dc2

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2[L(‖l‖∞ + 1)LcL+2 max1, p/2]1/2

K [(2L)−1(‖l‖∞+1)−2](183)

+5B3[L(‖l‖∞ + 1)2 maxp/2, ln(eM)]1/2

M 1/4

≤ 6dc2

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2L(‖l‖∞ + 1)LcL+1 max1, pK [(2L)−1(‖l‖∞+1)−2]

+5B2L(‖l‖∞ + 1) maxp, ln(eM)

M 1/4.

The proof of Corollary 6.6 is thus complete.

6.3 Full strong error analysis for the training of DNNs withoptimisation via stochastic gradient descent with randominitialisation

Corollary 6.7. Let d,d,L,M,K,N ∈ N, L, a, u ∈ R, b ∈ (a,∞), v ∈ (u,∞), c ∈[max1, L, |a|, |b|, 2|u|, 2|v|,∞), B ∈ [c,∞), l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . ,N, (Jn)n∈N ⊆ N, (γn)n∈N ⊆ R, assume 0 ∈ N, l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1),

let (Ω,F ,P) be a probability space, let Xk,nj : Ω → [a, b]d, k, n, j ∈ N0, and Y k,n

j : Ω →[u, v], k, n, j ∈ N0, be functions, assume that (X0,0

j , Y 0,0j ), j ∈ 1, 2, . . . ,M, are i.i.d. ran-

dom variables, let E : [a, b]d → [u, v] satisfy P-a.s. that E(X0,01 ) = E[Y 0,0

1 |X0,01 ], assume for

all x, y ∈ [a, b]d that |E(x)−E(y)| ≤ L‖x−y‖1, let Θk,n : Ω→ Rd, k, n ∈ N0, and k : Ω→(N0)2 be random variables, assume

(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, assume that Θk,0, k ∈

40

Page 41: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

1, 2, . . . , K, are i.i.d., assume that Θ1,0 is continuous uniformly distributed on [−c, c]d,let Rk,n

J : Rd × Ω → [0,∞), k, n, J ∈ N0, and Gk,n : Rd × Ω → Rd, k, n ∈ N, satisfy forall k, n ∈ N, ω ∈ Ω, θ ∈ ϑ ∈ Rd : (Rk,n

Jn (·, ω) : Rd → [0,∞) is differentiable at ϑ) thatGk,n(θ, ω) = (∇θRk,n

Jn )(θ, ω), assume for all k, n ∈ N that Θk,n = Θk,n−1 − γnGk,n(Θk,n−1),and assume for all k, n ∈ N0, J ∈ N, θ ∈ Rd, ω ∈ Ω that

Rk,nJ (θ, ω) =

1

J

[J∑j=1

|N θ,lu,v (Xk,n

j (ω))− Y k,nj (ω)|2

]and (184)

k(ω) ∈ arg min(l,m)∈1,2,...,K×N, ‖Θl,m(ω)‖∞≤BR0,0M (Θl,m(ω), ω) (185)

(cf. Definitions 2.8 and 3.1). Then it holds for all p ∈ (0,∞) that(E[(∫

[a,b]d|N Θk,l

u,v (x)− E(x)|2 PX0,01

(dx))p/2 ])1/p

≤ 3dL(b− a)

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2[(v − u)L(‖l‖∞ + 1)LcL+1 max1, p/2]1/2

K [(2L)−1(‖l‖∞+1)−2]

+3 max1, v − u(‖l‖∞ + 1)[L maxp, 2 ln(3MBc)]1/2

M 1/4(186)

≤ 6dc2

[min(L ∪ li : i ∈ N ∩ [0,L))]1/d+

2L(‖l‖∞ + 1)LcL+1 max1, pK [(2L)−1(‖l‖∞+1)−2]

+5B2L(‖l‖∞ + 1) maxp, ln(eM)

M 1/4

(cf. (iii) in Lemma 6.2).

Proof of Corollary 6.7. Observe that Corollary 6.6 (with (Xj)j∈N ← (X0,0j )j∈N, (Yj)j∈N ←

(Y 0,0j )j∈N, R ← R0,0

M in the notation of Corollary 6.6) shows (186). The proof of Corol-lary 6.7 is thus complete.

Corollary 6.8. Let d,d,L,M,K,N ∈ N, L, a, u ∈ R, b ∈ (a,∞), v ∈ (u,∞), c ∈[max1, L, |a|, |b|, 2|u|, 2|v|,∞), B ∈ [c,∞), l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . ,N, (Jn)n∈N ⊆ N, (γn)n∈N ⊆ R, assume 0 ∈ N, l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1),

let (Ω,F ,P) be a probability space, let Xk,nj : Ω → [a, b]d, k, n, j ∈ N0, and Y k,n

j : Ω →[u, v], k, n, j ∈ N0, be functions, assume that (X0,0

j , Y 0,0j ), j ∈ 1, 2, . . . ,M, are i.i.d. ran-

dom variables, let E : [a, b]d → [u, v] satisfy P-a.s. that E(X0,01 ) = E[Y 0,0

1 |X0,01 ], assume for

all x, y ∈ [a, b]d that |E(x)−E(y)| ≤ L‖x−y‖1, let Θk,n : Ω→ Rd, k, n ∈ N0, and k : Ω→(N0)2 be random variables, assume

(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, assume that Θk,0, k ∈

1, 2, . . . , K, are i.i.d., assume that Θ1,0 is continuous uniformly distributed on [−c, c]d,let Rk,n

J : Rd × Ω → [0,∞), k, n, J ∈ N0, and Gk,n : Rd × Ω → Rd, k, n ∈ N, satisfy forall k, n ∈ N, ω ∈ Ω, θ ∈ ϑ ∈ Rd : (Rk,n

Jn (·, ω) : Rd → [0,∞) is differentiable at ϑ) thatGk,n(θ, ω) = (∇θRk,n

Jn )(θ, ω), assume for all k, n ∈ N that Θk,n = Θk,n−1 − γnGk,n(Θk,n−1),and assume for all k, n ∈ N0, J ∈ N, θ ∈ Rd, ω ∈ Ω that

Rk,nJ (θ, ω) =

1

J

[J∑j=1

|N θ,lu,v (Xk,n

j (ω))− Y k,nj (ω)|2

]and (187)

k(ω) ∈ arg min(l,m)∈1,2,...,K×N, ‖Θl,m(ω)‖∞≤BR0,0M (Θl,m(ω), ω) (188)

41

Page 42: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

(cf. Definitions 2.8 and 3.1). Then

E[∫

[a,b]d|N Θk,l

u,v (x)− E(x)|PX0,01

(dx)]≤ 2[(v − u)L(‖l‖∞ + 1)LcL+1]1/2

K [(2L)−1(‖l‖∞+1)−2]

+3dL(b− a)

[minL, l1, l2, . . . , lL−1]1/d+

3 max1, v − u(‖l‖∞ + 1)[2L ln(3MBc)]1/2

M 1/4

≤ 6dc2

[minL, l1, l2, . . . , lL−1]1/d+

5B2L(‖l‖∞ + 1) ln(eM)

M 1/4+

2L(‖l‖∞ + 1)LcL+1

K [(2L)−1(‖l‖∞+1)−2]

(189)

(cf. (iii) in Lemma 6.2).

Proof of Corollary 6.8. Note that Jensen’s inequality implies that

E[∫

[a,b]d|N Θk,l

u,v (x)−E(x)|PX0,01

(dx)]≤ E

[(∫[a,b]d|N Θk,l

u,v (x)−E(x)|2 PX0,01

(dx))1/2 ]

. (190)

This and Corollary 6.7 (with p← 1 in the notation of Corollary 6.7) complete the proofof Corollary 6.8.

Corollary 6.9. Let d,d,L,M,K,N ∈ N, L ∈ R, c ∈ [max2, L,∞), B ∈ [c,∞),l = (l0, l1, . . . , lL) ∈ NL+1, N ⊆ 0, 1, . . . , N, (Jn)n∈N ⊆ N, (γn)n∈N ⊆ R, assume0 ∈ N, l0 = d, lL = 1, and d ≥

∑Li=1 li(li−1 + 1), let (Ω,F ,P) be a probability space, let

Xk,nj : Ω → [0, 1]d, k, n, j ∈ N0, and Y k,n

j : Ω → [0, 1], k, n, j ∈ N0, be functions, assumethat (X0,0

j , Y 0,0j ), j ∈ 1, 2, . . . ,M, are i.i.d. random variables, let E : [0, 1]d → [0, 1]

satisfy P-a.s. that E(X0,01 ) = E[Y 0,0

1 |X0,01 ], assume for all x, y ∈ [0, 1]d that |E(x)−E(y)| ≤

L‖x− y‖1, let Θk,n : Ω→ Rd, k, n ∈ N0, and k : Ω→ (N0)2 be random variables, assume(⋃∞k=1 Θk,0(Ω)

)⊆ [−B,B]d, assume that Θk,0, k ∈ 1, 2, . . . , K, are i.i.d., assume that

Θ1,0 is continuous uniformly distributed on [−c, c]d, let Rk,nJ : Rd × Ω→ [0,∞), k, n, J ∈

N0, and Gk,n : Rd × Ω → Rd, k, n ∈ N, satisfy for all k, n ∈ N, ω ∈ Ω, θ ∈ ϑ ∈Rd : (Rk,n

Jn (·, ω) : Rd → [0,∞) is differentiable at ϑ) that Gk,n(θ, ω) = (∇θRk,nJn )(θ, ω),

assume for all k, n ∈ N that Θk,n = Θk,n−1−γnGk,n(Θk,n−1), and assume for all k, n ∈ N0,J ∈ N, θ ∈ Rd, ω ∈ Ω that

Rk,nJ (θ, ω) =

1

J

[J∑j=1

|N θ,lu,v (Xk,n

j (ω))− Y k,nj (ω)|2

]and (191)

k(ω) ∈ arg min(l,m)∈1,2,...,K×N, ‖Θl,m(ω)‖∞≤BR0,0M (Θl,m(ω), ω) (192)

(cf. Definitions 2.8 and 3.1). Then

E[∫

[0,1]d|N Θk,l

u,v (x)− E(x)|PX0,01

(dx)]

≤ 3dL

[minL, l1, l2, . . . , lL−1]1/d+

3(‖l‖∞ + 1)[2L ln(3MBc)]1/2

M 1/4+

2[L(‖l‖∞ + 1)LcL+1]1/2

K [(2L)−1(‖l‖∞+1)−2]

≤ dc3

[minL, l1, l2, . . . , lL−1]1/d+B3L(‖l‖∞ + 1) ln(eM)

M 1/4+

L(‖l‖∞ + 1)LcL+1

K [(2L)−1(‖l‖∞+1)−2](193)

(cf. (iii) in Lemma 6.2).

Proof of Corollary 6.9. Observe that Corollary 6.8 (with a← 0, u← 0, b← 1, v ← 1 inthe notation of Corollary 6.8), the facts that B ≥ c ≥ max2, L and M ≥ 1, and (ii) in

42

Page 43: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

Lemma 6.4 show that

E[∫

[0,1]d|N Θk,l

u,v (x)− E(x)|PX0,01

(dx)]

≤ 3dL

[minL, l1, l2, . . . , lL−1]1/d+

3(‖l‖∞ + 1)[2L ln(3MBc)]1/2

M 1/4+

2[L(‖l‖∞ + 1)LcL+1]1/2

K [(2L)−1(‖l‖∞+1)−2]

≤ dc3

[minL, l1, l2, . . . , lL−1]1/d+

(‖l‖∞ + 1)[23BL ln(eM)]1/2

M 1/4+

[L(‖l‖∞ + 1)Lc2L+2]1/2

K [(2L)−1(‖l‖∞+1)−2]

≤ dc3

[minL, l1, l2, . . . , lL−1]1/d+B3L(‖l‖∞ + 1) ln(eM)

M 1/4+

L(‖l‖∞ + 1)LcL+1

K [(2L)−1(‖l‖∞+1)−2]. (194)

The proof of Corollary 6.9 is thus complete.

Acknowledgements

This work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Re-search Foundation) under Germany’s Excellence Strategy EXC 2044-390685587, Mathe-matics Munster: Dynamics-Geometry-Structure, by the Swiss National Science Founda-tion (SNSF) under the project “Deep artificial neural network approximations for stochas-tic partial differential equations: Algorithms and convergence proofs” (project number184220), and through the ETH Research Grant ETH-47 15-2 “Mild stochastic calculusand numerical approximations for nonlinear stochastic evolution equations with Levynoise”.

References

[1] Aliprantis, C. D., and Border, K. C. Infinite dimensional analysis. Third. Ahitchhiker’s guide. Springer, Berlin, 2006, xxii+703. isbn: 978-3-540-32696-0. url:https://doi.org/10.1007/3-540-29587-9.

[2] Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-Grained Analysis ofOptimization and Generalization for Overparameterized Two-Layer Neural Net-works. In: Proceedings of the 36th International Conference on Machine Learning.Ed. by Chaudhuri, K., and Salakhutdinov, R. Vol. 97. Proceedings of MachineLearning Research. PMLR, Long Beach, California, USA, June 2019, 322–332. url:http://proceedings.mlr.press/v97/arora19a.html.

[3] Bach, F. Breaking the curse of dimensionality with convex neural networks. J.Mach. Learn. Res. 18 (2017), 53 pages. issn: 1532-4435.

[4] Bach, F., and Moulines, E. Non-strongly-convex smooth stochastic approxima-tion with convergence rate O(1/n). In: Advances in Neural Information ProcessingSystems 26. Ed. by Burges, C. J. C., Bottou, L., Welling, M., Ghahramani,Z., and Weinberger, K. Q. Curran Associates, Inc., 2013, 773–781. url: http://papers.nips.cc/paper/4900-non-strongly-convex-smooth-stochastic-

approximation-with-convergence-rate-o1n.pdf.

[5] Barron, A. R. Universal approximation bounds for superpositions of a sigmoidalfunction. IEEE Trans. Inform. Theory 39, 3 (1993), 930–945. issn: 0018-9448. url:https://doi.org/10.1109/18.256500.

43

Page 44: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[6] Barron, A. R. Approximation and estimation bounds for artificial neural net-works. Machine Learning 14, 1 (1994), 115–133. url: https://doi.org/10.1007/BF00993164.

[7] Bartlett, P. L., Bousquet, O., and Mendelson, S. Local Rademacher com-plexities. Ann. Statist. 33, 4 (2005), 1497–1537. issn: 0090-5364. url: https://doi.org/10.1214/009053605000000282.

[8] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solvingstochastic differential equations and Kolmogorov equations by means of deep learn-ing. ArXiv e-prints (2018), 56 pages. arXiv: 1806.00421 [math.NA].

[9] Beck, C., E, W., and Jentzen, A. Machine Learning Approximation Algorithmsfor High-Dimensional Fully Nonlinear Partial Differential Equations and Second-order Backward Stochastic Differential Equations. J. Nonlinear Sci. 29, 4 (2019),1563–1619. issn: 0938-8974. url: https://doi.org/10.1007/s00332-018-9525-3.

[10] Beck, C., Jentzen, A., and Kuckuck, B. Full error analysis for the train-ing of deep neural networks. ArXiv e-prints (2019), 53 pages. arXiv: 1910.00121[math.NA].

[11] Bellman, R. Dynamic programming. Princeton University Press, Princeton, N.J., 1957, xxv+342.

[12] Bercu, B., and Fort, J.-C. Generic Stochastic Gradient Methods. In: WileyEncyclopedia of Operations Research and Management Science. American CancerSociety, 2013, 1–8. isbn: 9780470400531. url: https://doi.org/10.1002/9780470400531.eorms1068.

[13] Berner, J., Grohs, P., and Jentzen, A. Analysis of the generalization error:Empirical risk minimization over deep artificial neural networks overcomes the curseof dimensionality in the numerical approximation of Black-Scholes partial differen-tial equations. ArXiv e-prints (2018), 30 pages. arXiv: 1809.03062 [cs.LG].

[14] Blum, E. K., and Li, L. K. Approximation theory and feedforward networks.Neural Networks 4, 4 (1991), 511–515. issn: 0893-6080. url: https://doi.org/10.1016/0893-6080(91)90047-9.

[15] Bolcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. Optimal approx-imation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1,1 (2019), 8–45. issn: 2577-0187. url: https://doi.org/10.1137/18M118709X.

[16] Burger, M., and Neubauer, A. Error bounds for approximation with neuralnetworks. J. Approx. Theory 112, 2 (2001), 235–250. issn: 0021-9045. url: https://doi.org/10.1006/jath.2001.3613.

[17] Candes, E. J. Ridgelets: theory and applications. PhD thesis. Department ofStatistics, Stanford University, 1998.

[18] Chau, N. H., Moulines, E., Rasonyi, M., Sabanis, S., and Zhang, Y. Onstochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case. ArXiv e-prints (2019), 27 pages. arXiv: 1905.13142 [math.ST].

[19] Chen, T., and Chen, H. Approximation capability to functions of several vari-ables, nonlinear functionals, and operators by radial basis function neural networks.IEEE Trans. Neural Netw. 6, 4 (1995), 904–910. issn: 1941-0093. url: https:

//doi.org/10.1109/72.392252.

44

Page 45: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[20] Cheridito, P., Jentzen, A., and Rossmannek, F. Efficient approximation ofhigh-dimensional functions with deep neural networks. ArXiv e-prints (2019), 19pages. arXiv: 1912.04310 [math.NA].

[21] Chui, C. K., Li, X., and Mhaskar, H. N. Neural networks for localized ap-proximation. Math. Comp. 63, 208 (1994), 607–623. issn: 0025-5718. url: https://doi.org/10.2307/2153285.

[22] Cox, S., Hutzenthaler, M., Jentzen, A., van Neerven, J., and Welti, T.Convergence in Holder norms with applications to Monte Carlo methods in infinitedimensions. ArXiv e-prints (2016), 50 pages. arXiv: 1605.00856 [math.NA]. Toappear in IMA J. Num. Anal.

[23] Cucker, F., and Smale, S. On the mathematical foundations of learning. Bull.Amer. Math. Soc. (N.S.) 39, 1 (2002), 1–49. issn: 0273-0979. url: https://doi.org/10.1090/S0273-0979-01-00923-5.

[24] Cybenko, G. Approximation by superpositions of a sigmoidal function. Math.Control Signals Systems 2, 4 (1989), 303–314. issn: 0932-4194. url: https://doi.org/10.1007/BF02551274.

[25] Dereich, S., and Kassing, S. Central limit theorems for stochastic gradientdescent with averaging for stable manifolds. ArXiv e-prints (2019), 42 pages. arXiv:1912.09187 [math.PR].

[26] Dereich, S., and Muller-Gronbach, T. General multilevel adaptations forstochastic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer. Math. 142, 2 (2019), 279–328. issn: 0029-599X. url: https://doi.org/10.1007/s00211-019-01024-y.

[27] DeVore, R. A., Oskolkov, K. I., and Petrushev, P. P. Approximation byfeed-forward neural networks. In: The heritage of P. L. Chebyshev: a Festschrift inhonor of the 70th birthday of T. J. Rivlin. Vol. 4. Baltzer Science Publishers BV,Amsterdam, 1997, 261–287.

[28] Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent ProvablyOptimizes Over-parameterized Neural Networks. ArXiv e-prints (2018), 19 pages.arXiv: 1810.02054 [cs.LG].

[29] Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient Descent Finds GlobalMinima of Deep Neural Networks. In: Proceedings of the 36th International Con-ference on Machine Learning. Ed. by Chaudhuri, K., and Salakhutdinov, R.Vol. 97. Proceedings of Machine Learning Research. PMLR, Long Beach, California,USA, June 2019, 1675–1685. url: http://proceedings.mlr.press/v97/du19c.html.

[30] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods forhigh-dimensional parabolic partial differential equations and backward stochasticdifferential equations. Commun. Math. Stat. 5, 4 (2017), 349–380. issn: 2194-6701.url: https://doi.org/10.1007/s40304-017-0117-6.

[31] E, W., Ma, C., and Wang, Q. A Priori Estimates of the Population Risk forResidual Networks. ArXiv e-prints (2019), 19 pages. arXiv: 1903.02154 [cs.LG].

[32] E, W., Ma, C., and Wu, L. A priori estimates of the population risk for two-layer neural networks. Comm. Math. Sci. 17, 5 (2019), 1407–1425. url: https:

//doi.org/10.4310/CMS.2019.v17.n5.a11.

45

Page 46: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[33] E, W., Ma, C., and Wu, L. A comparative analysis of optimization and gener-alization properties of two-layer neural network and random feature models undergradient descent dynamics. Sci. China Math. (2020). url: https://doi.org/10.1007/s11425-019-1628-5.

[34] E, W., and Wang, Q. Exponential convergence of the deep neural network ap-proximation for analytic functions. Sci. China Math. 61 (2018), 1733–1740. url:https://doi.org/10.1007/s11425-018-9387-x.

[35] Elbrachter, D., Grohs, P., Jentzen, A., and Schwab, C. DNN ExpressionRate Analysis of High-dimensional PDEs: Application to Option Pricing. ArXive-prints (2018), 50 pages. arXiv: 1809.07669 [math.FA].

[36] Eldan, R., and Shamir, O. The Power of Depth for Feedforward Neural Net-works. In: 29th Annual Conference on Learning Theory. Ed. by Feldman, V.,Rakhlin, A., and Shamir, O. Vol. 49. Proceedings of Machine Learning Re-search. PMLR, Columbia University, New York, New York, USA, June 2016, 907–940. url: http://proceedings.mlr.press/v49/eldan16.pdf.

[37] Ellacott, S. W. Aspects of the numerical analysis of neural networks. In: Actanumerica, 1994. Acta Numer. Cambridge University Press, Cambridge, 1994, 145–202. url: https://doi.org/10.1017/S0962492900002439.

[38] Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochasticgradient descent method for non-convex objective functions. ArXiv e-prints (2019),59 pages. arXiv: 1904.01517 [math.NA].

[39] Funahashi, K.-I. On the approximate realization of continuous mappings by neu-ral networks. Neural Networks 2, 3 (1989), 183–192. issn: 0893-6080. url: https://doi.org/10.1016/0893-6080(89)90003-8.

[40] Gautschi, W. Some elementary inequalities relating to the gamma and incompletegamma function. J. Math. and Phys. 38 (1959), 77–81. issn: 0097-1421. url: https://doi.org/10.1002/sapm195938177.

[41] Glorot, X., and Bengio, Y. Understanding the difficulty of training deep feed-forward neural networks. In: Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics. Ed. by Teh, Y. W., and Titterington,M. Vol. 9. Proceedings of Machine Learning Research. PMLR, Chia Laguna Resort,Sardinia, Italy, May 2010, 249–256. url: http://proceedings.mlr.press/v9/glorot10a.html.

[42] Gonon, L., Grohs, P., Jentzen, A., Kofler, D., and Siska, D. Uniform errorestimates for artificial neural network approximations for heat equations. ArXiv e-prints (2019), 70 pages. arXiv: 1911.09647 [math.NA].

[43] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. AdaptiveComputation and Machine Learning. MIT Press, Cambridge, MA, 2016, xxii+775.isbn: 978-0-262-03561-3.

[44] Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F. Approx-imation spaces of deep neural networks. ArXiv e-prints (2019), 63 pages. arXiv:1905.01208 [math.FA].

46

Page 47: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[45] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. Aproof that artificial neural networks overcome the curse of dimensionality in thenumerical approximation of Black-Scholes partial differential equations. ArXiv e-prints (2018), 124 pages. arXiv: 1809.02362 [math.NA]. To appear in Mem. Amer.Math. Soc.

[46] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time errorestimates for deep neural network approximations for differential equations. ArXive-prints (2019), 86 pages. arXiv: 1908.03833 [math.NA].

[47] Grohs, P., Jentzen, A., and Salimova, D. Deep neural network approxima-tions for Monte Carlo algorithms. ArXiv e-prints (2019), 45 pages. arXiv: 1908.10828 [math.NA].

[48] Grohs, P., Perekrestenko, D., Elbrachter, D., and Bolcskei, H. DeepNeural Network Approximation Theory. ArXiv e-prints (2019), 60 pages. arXiv:1901.02220 [cs.LG].

[49] Guhring, I., Kutyniok, G., and Petersen, P. Error bounds for approxima-tions with deep ReLU neural networks in W s,p norms. ArXiv e-prints (2019), 42pages. arXiv: 1902.07896 [math.FA].

[50] Guliyev, N. J., and Ismailov, V. E. Approximation capability of two hiddenlayer feedforward neural networks with fixed weights. Neurocomputing 316 (2018),262–269. issn: 0925-2312. url: https://doi.org/10.1016/j.neucom.2018.07.075.

[51] Guliyev, N. J., and Ismailov, V. E. On the approximation by single hiddenlayer feedforward neural networks with fixed weights. Neural Networks 98 (2018),296–304. issn: 0893-6080. url: https://doi.org/10.1016/j.neunet.2017.12.007.

[52] Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-freetheory of nonparametric regression. Springer Series in Statistics. Springer-Verlag,New York, 2002, xvi+647. isbn: 0-387-95441-4. url: https://doi.org/10.1007/b97848.

[53] Han, J., and Long, J. Convergence of the Deep BSDE Method for CoupledFBSDEs. ArXiv e-prints (2018), 27 pages. arXiv: 1811.01165 [math.PR].

[54] Hartman, E. J., Keeler, J. D., and Kowalski, J. M. Layered Neural Networkswith Gaussian Hidden Units as Universal Approximations. Neural Comput. 2, 2(1990), 210–215. url: https://doi.org/10.1162/neco.1990.2.2.210.

[55] Hornik, K. Approximation capabilities of multilayer feedforward networks. NeuralNetworks 4, 2 (1991), 251–257. issn: 0893-6080. url: https://doi.org/10.1016/0893-6080(91)90009-T.

[56] Hornik, K. Some new results on neural network approximation. Neural Networks6, 8 (1993), 1069–1072. issn: 0893-6080. url: https://doi.org/10.1016/S0893-6080(09)80018-X.

[57] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networksare universal approximators. Neural Networks 2, 5 (1989), 359–366. issn: 0893-6080.url: https://doi.org/10.1016/0893-6080(89)90020-8.

47

Page 48: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[58] Hornik, K., Stinchcombe, M., and White, H. Universal approximation of anunknown mapping and its derivatives using multilayer feedforward networks. NeuralNetworks 3, 5 (1990), 551–560. issn: 0893-6080. url: https://doi.org/10.1016/0893-6080(90)90005-6.

[59] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof thatrectified deep neural networks overcome the curse of dimensionality in the numer-ical approximation of semilinear heat equations. ArXiv e-prints (2019), 29 pages.arXiv: 1901.10854 [math.NA]. To appear in SN Partial Differential Equations andApplications.

[60] Jentzen, A., Kuckuck, B., Neufeld, A., and von Wurstemberger, P.Strong error analysis for stochastic gradient descent optimization algorithms. ArXive-prints (2018), 75 pages. arXiv: 1801.09324 [math.NA]. Revision requested fromIMA J. Numer. Anal.

[61] Jentzen, A., Salimova, D., and Welti, T. A proof that deep artificial neuralnetworks overcome the curse of dimensionality in the numerical approximation ofKolmogorov partial differential equations with constant diffusion and nonlinear driftcoefficients. ArXiv e-prints (2018), 48 pages. arXiv: 1809.07321 [math.NA].

[62] Jentzen, A., and von Wurstemberger, P. Lower error bounds for the stochas-tic gradient descent optimization algorithm: Sharp convergence rates for slowly andfast decaying learning rates. J. Complexity 57 (2020), 101438. issn: 0885-064X. url:https://doi.org/10.1016/j.jco.2019.101438.

[63] Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. Non-asymptoticAnalysis of Biased Stochastic Approximation Scheme. ArXiv e-prints (2019), 32pages. arXiv: 1902.00629 [stat.ML].

[64] Kutyniok, G., Petersen, P., Raslan, M., and Schneider, R. A TheoreticalAnalysis of Deep Neural Networks and Parametric PDEs. ArXiv e-prints (2019), 43pages. arXiv: 1904.00377 [math.NA].

[65] Lei, Y., Hu, T., Li, G., and Tang, K. Stochastic Gradient Descent for NonconvexLearning without Bounded Gradient Assumptions. ArXiv e-prints (2019), 7 pages.arXiv: 1902.00908 [cs.LG].

[66] Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforwardnetworks with a nonpolynomial activation function can approximate any function.Neural Networks 6, 6 (1993), 861–867. issn: 0893-6080. url: https://doi.org/10.1016/S0893-6080(05)80131-5.

[67] Massart, P. Concentration inequalities and model selection. Vol. 1896. LectureNotes in Mathematics. Lectures from the 33rd Summer School on Probability The-ory held in Saint-Flour, July 6–23, 2003. Springer, Berlin, 2007, xiv+337. isbn:978-3-540-48497-4. url: https://doi.org/10.1007/978-3-540-48503-2.

[68] Mhaskar, H. N. Neural Networks for Optimal Approximation of Smooth andAnalytic Functions. Neural Comput. 8, 1 (1996), 164–177. url: https://doi.org/10.1162/neco.1996.8.1.164.

[69] Mhaskar, H. N., and Micchelli, C. A. Degree of approximation by neural andtranslation networks with a single hidden layer. Adv. in Appl. Math. 16, 2 (1995),151–183. issn: 0196-8858. url: https://doi.org/10.1006/aama.1995.1008.

48

Page 49: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[70] Mhaskar, H. N., and Poggio, T. Deep vs. shallow networks: an approximationtheory perspective. Anal. Appl. (Singap.) 14, 6 (2016), 829–848. issn: 0219-5305.url: https://doi.org/10.1142/S0219530516400042.

[71] Montanelli, H., and Du, Q. New error bounds for deep ReLU networks usingsparse grids. SIAM J. Math. Data Sci. 1, 1 (2019), 78–92. url: https://doi.org/10.1137/18M1189336.

[72] Nguyen-Thien, T., and Tran-Cong, T. Approximation of functions and theirderivatives: A neural network implementation with applications. Appl. Math. Model.23, 9 (1999), 687–704. issn: 0307-904X. url: https://doi.org/10.1016/S0307-904X(99)00006-2.

[73] Novak, E., and Wozniakowski, H. Tractability of multivariate problems. Vol.1: Linear information. Vol. 6. EMS Tracts in Mathematics. European MathematicalSociety (EMS), Zurich, 2008, xii+384. isbn: 978-3-03719-026-5. url: https://doi.org/10.4171/026.

[74] Novak, E., and Wozniakowski, H. Tractability of multivariate problems. Vol-ume II: Standard information for functionals. Vol. 12. EMS Tracts in Mathematics.European Mathematical Society (EMS), Zurich, 2010, xviii+657. isbn: 978-3-03719-084-5. url: https://doi.org/10.4171/084.

[75] Park, J., and Sandberg, I. W. Universal Approximation Using Radial-Basis-Function Networks. Neural Comput. 3, 2 (1991), 246–257. url: https://doi.org/10.1162/neco.1991.3.2.246.

[76] Perekrestenko, D., Grohs, P., Elbrachter, D., and Bolcskei, H. Theuniversal approximation power of finite-width deep ReLU networks. ArXiv e-prints(2018), 16 pages. arXiv: 1806.01528 [cs.LG].

[77] Petersen, P., Raslan, M., and Voigtlaender, F. Topological properties ofthe set of functions generated by neural networks of fixed size. ArXiv e-prints (2018),51 pages. arXiv: 1806.08459 [math.GN].

[78] Petersen, P., and Voigtlaender, F. Equivalence of approximation by convo-lutional neural networks and fully-connected networks. ArXiv e-prints (2018), 10pages. arXiv: 1809.00973 [math.FA].

[79] Petersen, P., and Voigtlaender, F. Optimal approximation of piecewisesmooth functions using deep ReLU neural networks. Neural Networks 108 (2018),296–330. issn: 0893-6080. url: https://doi.org/10.1016/j.neunet.2018.08.019.

[80] Pinkus, A. Approximation theory of the MLP model in neural networks. In: Actanumerica, 1999. Vol. 8. Acta Numer. Cambridge University Press, Cambridge, 1999,143–195. url: https://doi.org/10.1017/S0962492900002919.

[81] Qi, F. Bounds for the ratio of two gamma functions. J. Inequal. Appl. (2010), Art.ID 493058, 84. issn: 1025-5834. url: https://doi.org/10.1155/2010/493058.

[82] Reisinger, C., and Zhang, Y. Rectified deep neural networks overcome the curseof dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiffsystems. ArXiv e-prints (2019), 34 pages. arXiv: 1903.06652 [math.NA].

[83] Robbins, H., and Monro, S. A stochastic approximation method. Ann. Math.Statistics 22 (1951), 400–407. issn: 0003-4851. url: https://doi.org/10.1214/aoms/1177729586.

49

Page 50: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[84] Schmitt, M. Lower Bounds on the Complexity of Approximating ContinuousFunctions by Sigmoidal Neural Networks. In: Advances in Neural Information Pro-cessing Systems 12. Ed. by Solla, S. A., Leen, T. K., and Muller, K. MITPress, 2000, 328–334. url: http : / / papers . nips . cc / paper / 1692 - lower -

bounds-on-the-complexity-of-approximating-continuous-functions-by-

sigmoidal-neural-networks.pdf.

[85] Schwab, C., and Zech, J. Deep learning in high dimension: neural networkexpression rates for generalized polynomial chaos expansions in UQ. Anal. Appl.(Singap.) 17, 1 (2019), 19–55. issn: 0219-5305. url: https://doi.org/10.1142/S0219530518500203.

[86] Shaham, U., Cloninger, A., and Coifman, R. R. Provable approximationproperties for deep neural networks. Appl. Comput. Harmon. Anal. 44, 3 (2018),537–557. issn: 1063-5203. url: https://doi.org/10.1016/j.acha.2016.04.003.

[87] Shalev-Shwartz, S., and Ben-David, S. Understanding Machine Learning:From Theory to Algorithms. Cambridge University Press, Cambridge, 2014. url:https://doi.org/10.1017/CBO9781107298019.

[88] Shamir, O. Exponential Convergence Time of Gradient Descent for One-Dimen-sional Deep Linear Neural Networks. In: Proceedings of the Thirty-Second Confer-ence on Learning Theory. Ed. by Beygelzimer, A., and Hsu, D. Vol. 99. Pro-ceedings of Machine Learning Research. PMLR, Phoenix, USA, June 2019, 2691–2713. url: http://proceedings.mlr.press/v99/shamir19a.html.

[89] Shen, Z., Yang, H., and Zhang, S. Deep Network Approximation Character-ized by Number of Neurons. ArXiv e-prints (2019), 36 pages. arXiv: 1906.05497[math.NA].

[90] Shen, Z., Yang, H., and Zhang, S. Nonlinear approximation via compositions.Neural Networks 119 (2019), 74–84. issn: 0893-6080. url: https://doi.org/10.1016/j.neunet.2019.07.011.

[91] Sirignano, J., and Spiliopoulos, K. DGM: A deep learning algorithm forsolving partial differential equations. J. Comput. Phys. 375 (2018), 1339–1364. issn:0021-9991. url: https://doi.org/10.1016/j.jcp.2018.08.029.

[92] van de Geer, S. A. Applications of empirical process theory. Vol. 6. CambridgeSeries in Statistical and Probabilistic Mathematics. Cambridge University Press,Cambridge, 2000, xii+286. isbn: 0-521-65002-X.

[93] Voigtlaender, F., and Petersen, P. Approximation in Lp(µ) with deep ReLUneural networks. ArXiv e-prints (2019), 4 pages. arXiv: 1904.04789 [math.FA].

[94] Wendel, J. G. Note on the gamma function. Amer. Math. Monthly 55 (1948),563–564. issn: 0002-9890. url: https://doi.org/10.2307/2304460.

[95] Yarotsky, D. Error bounds for approximations with deep ReLU networks. NeuralNetworks 94 (2017), 103–114. issn: 0893-6080. url: https://doi.org/10.1016/j.neunet.2017.07.002.

[96] Yarotsky, D. Universal approximations of invariant maps by neural networks.ArXiv e-prints (2018), 64 pages. arXiv: 1804.10306 [cs.NE].

50

Page 51: Overall error analysis for the training of deep neural ... · 1 Introduction Deep learning based algorithms have been applied extremely successfully to overcome fun-damental challenges

[97] Zhang, G., Martens, J., and Grosse, R. B. Fast Convergence of NaturalGradient Descent for Over-Parameterized Neural Networks. In: Advances in Neu-ral Information Processing Systems 32. Ed. by Wallach, H., Larochelle, H.,Beygelzimer, A., d’Alche-Buc, F., Fox, E., and Garnett, R. Curran As-sociates, Inc., 2019, 8082–8093. url: http://papers.nips.cc/paper/9020-

fast-convergence-of-natural-gradient-descent-for-over-parameterized-

neural-networks.pdf.

[98] Zou, D., Cao, Y., Zhou, D., and Gu, Q. Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. (2019). url: https://doi.org/10.1007/s10994-019-05839-6.

51