implicit deep learning - arxiv · network in order to solve a (relaxed) maxsat problem. in [16],...

Implicit Deep Learning∗

Laurent El Ghaoui† , Fangda Gu† , Bertrand Travacca‡ , Armin Askari† , and Alicia Y. Tsai†

Abstract. Implicit deep learning prediction rules generalize the recursive rules of feedforward neural networks.Such rules are based on the solution of a fixed-point equation involving a single vector of hiddenfeatures, which is thus only implicitly defined. The implicit framework greatly simplifies the notationof deep learning, and opens up many new possibilities, in terms of novel architectures and algorithms,robustness analysis and design, interpretability, sparsity, and network architecture optimization.

Key words. Deep learning, implicit models, Perron-Frobenius theory, robustness, adversarial attacks.

AMS subject classifications. 690C26, 49M99, 65K10, 62M45, 26B10

φ

(A BC D

)y(u) u

z x

Figure 1. A block-diagram view of an implicit model.

1. Introduction.

1.1. Implicit prediction rules. In this paper, we consider a new class of deep learningmodels that are based on implicit prediction rules. Such rules are not obtained via a recursiveprocedure through several layers, as in current neural networks. Instead, they are based onsolving a fixed-point equation in some single “state” vector x ∈ Rn. Precisely, for a giveninput vector u, the predicted vector is

y(u) = Cx+Du [prediction equation](1.1a)

x = φ(Ax+Bu) [equilibrium equation](1.1b)

where φ : Rn → Rn is a nonlinear vector map (the “activation” map), and matrices A,B,C,Dcontain model parameters. Figure 1 provides a block-diagram view of an implicit model, tobe read from right to left, so as to be consistent with matrix-vector multiplication rules.

We can think of the vector x ∈ Rn as a “state” corresponding to n “hidden” features thatare extracted from the inputs, based on the so-called equilibrium equation (1.1b). In general,

∗Submitted to the editors, August 6, 2020.Funding: This work was funded in part by sumup.ai, the National Science Foundation, the Pacific Earthquake

Engineering Research Center, and Total S.A.†EECS and IEOR departments, UC Berkeley ([email protected]).‡CEE department, UC Berkeley.

1

arX

iv:1

908.

0631

5v4

[cs

.LG

] 6

Aug

202

0

sumup.ai

mailto:[email protected]

2 EL GHAOUI ET AL.

that equation cannot be solved in closed-form, and the model above provides x only implicitly.This equation is not necessarily well-posed, in the sense that it may not admit a solution, letalone a unique one; we discuss this important issue of well-posedness in section 2.

For notational simplicity only, our rule does not contain any bias terms; we can easilyaccount for those by considering the vector (u, 1) instead of u, thereby increasing the columndimension of B by one.

Perhaps surprisingly, as seen in section 3, the implicit framework includes most currentneural network architectures as special cases. Implicit models are a much wider class: theypresent much more capacity, as measured by the number of parameters for a given dimensionof the hidden features; also, they allow for cycles in the network, which is not permitted underthe current paradigm of deep networks.

Implicit rules open up the possibility of using novel architectures and prediction rules fordeep learning, which are not based on any notion of “network” or “layers”, as is classically un-derstood. In addition, they allow one to consider rigorous approaches to challenging problemsin deep learning, ranging from robustness analysis, sparsity and interpretability, and featureselection.

1.2. Contributions and paper outline. Our contributions in this paper, and its outline,are as follows.

• Well-posedness and composition (section 2): In contrast with standard deep networks,implicit models may not be well-posed, in the sense that the equilibrium equationmay have no or multiple solutions. We establish rigorous and numerically tractableconditions for implicit rules to be well-posed. These conditions are then used in thetraining problem, guaranteeing the well-posedness of the learned prediction rule. Wealso discuss the composition of implicit models, via cascade connections for example.• Implicit models of neural networks (section 3): We provide details on how to represent

a wide variety of neural networks as implicit models, building on the composition rulesof section 2.• Robustness analysis (section 4): We describe how to analyze the robustness properties

of a given implicit model, deriving bounds on the state under input perturbations,and generating adversarial attacks. We also discuss which penalties to include intothe training problem so as to encourage robustness of the learned rule.• Interpretability, sparsity, compression and deep feature selection (section 5): Here we

focus on finding appropriate penalties to use in order to improve properties such asmodel sparsity, or obtain feature selection. We also discuss the impact of model errors.• Training problem: formulations and algorithms (section 6): Informed by our previous

findings, we finally discuss the corresponding training problem. Following the work of[24] and [34], we represent activation functions using so-called Fenchel divergences, inorder to relax the training problem into a more tractable form. We discuss several al-gorithms, including stochastic projected gradients, Frank-Wolfe, and block-coordinatedescent.

Finally, section 7 provides a few experiments supporting the theory put forth in this paper.Our final section 8 is devoted to prior work and references.

IMPLICIT DEEP LEARNING 3

1.3. Notation. For a matrix U , |U | (resp. U+) denotes the matrix with the absolutevalues (resp. positive part) of the entries of U . For a vector v, we denote by diag(v) thediagonal matrix formed with the entries of v; for a square matrix V , diag(V ) is the vectorformed with the diagonal elements of V . The notation 1 refers to the vector of ones, with sizeinferred from context. The Hadamard (componentwise) product between two n-vectors x, yis denoted x� y. We use sk(z) to denote the sum of the largest k entries of a vector z. For amatrix A, and integers p, q ≥ 1, we define the induced norm

‖A‖p→q = maxξ‖Aξ‖q : ‖ξ‖p ≤ 1.

The case when p = q = ∞ corresponds to the l∞-induced norm of A, also known as itsmax-row-sum norm:

‖A‖∞ := maxi

∑j

|Aij |.

We denote the set {1, · · · , L} compactly as [L]. For a n-vector partitioned into L blocks,z = (z1, . . . , zL), with zl ∈ Rnl , l ∈ [L], with n1 + . . .+nL = n, we denote by η(z) the L-vectorof norms:

η(z) := (‖z1‖p1 , . . . , ‖zL‖pL)>.(1.2)

Finally, any square, non-negative matrix M admits a real eigenvalue that is larger than themodulus of any other eigenvalue; this non-negative eigenvalue is the so-called Perron-Frobeniuseigenvalue [39], and is denoted λpf(M).

2. Well-Posedness and Composition.

2.1. Assumptions on the activation map. We restrict our attention to activation mapsφ that obey a “Blockwise LIPschitz” (BLIP) continuity condition. This condition is satisfiedfor most popular activation maps, and arises naturally when “composing” implicit models(see subsection 2.4). Precisely, we assume that:

1. Blockwise: the map φ acts in a block-wise fashion, that is, there exist a partition ofn: n = n1 + . . . + nL such that for every vector partitioned into the correspondingblocks: z = (z1, . . . , zL) with zl ∈ Rnl , l ∈ [L], we have φ(z) = (φ1(z1), . . . , φL(zL)) forappropriate maps φl : Rnl → Rnl , l ∈ [L].

2. Lipschitz: For every l ∈ [L], the maps φl are Lipschitz-continuous with constant γl > 0with respect to the lpl-norm for some integer pl ≥ 1:

∀ u, v ∈ Rnl : ‖φl(u)− φl(v)‖pl ≤ γl‖u− v‖pl .

In the remainder of the paper, we refer to such maps with the acronym BLIP, omitting thedependence on the underlying structure information (integers nl, pl, γl, l ∈ [L]). We shallconsider a special case, referred to a COmponentwise Non-Expansive (CONE) maps, whennl = 1, γl = 1, l ∈ [L]. Such CONE maps satisfy

∀ u, v ∈ Rn : |φ(u)− φ(v)| ≤ |u− v|,(2.1)

4 EL GHAOUI ET AL.

with inequality and absolute value taken componentwise. Examples of CONE maps includethe ReLU (defined as φ(·) = max(0, ·)) and its “leaky” variants, tanh, sigmoid, each ap-plied componentwise to a vector input. Our model also allows for maps that do not operatecomponentwise, such as the softmax function, which operates on a n-vector z as:

z → SoftMax(z) :=

(ezi∑i ezj

)i∈[n]

,(2.2)

The softmax map is 1-Lipschitz-continuous with respect to the l1-norm [21].

2.2. Well-posed matrices. We consider the prediction rule (1.1a) with input point u ∈ Rpand predicted output vector y(u) ∈ Rq. The equilibrium equation (1.1b) does not necessarilyhave a well-defined, unique solution x, as Figure 2 illustrates in a scalar case. In orderto prevent this, we assume that the n × n matrix A satisfies the following well-posednessproperty.

Definition 2.1 (Well-posedness property). The n×n matrix A is said to be well-posed for φ(in short, A ∈WP(φ)) if, for any n-vector b, the equation in x ∈ Rn:

(2.3) x = φ(Ax+ b)

has a unique solution.

Figure 2. Left: equation x = (ax + b)+ has two or no solutions, depending on the sign of b. Right: fora < 0, solution is unique for every b.

There are many classes of matrices that satisfy the well-posedness property. As seen next,strictly upper-triangular matrices are well-posed with respect to any activation map that actscomponentwise; such a class arises when modeling feedforward neural networks as implicitmodels, as seen in subsection 3.2.

2.3. Tractable sufficient conditions for well-posedness. Our goal now is to understandhow we can constrain A to have the well-posedness property, in a numerically tractable way.

A sufficient condition is based on the contraction mapping theorem. The following resultfocuses on the case when φ is componentwise non-expansive (CONE), as defined in subsec-tion 2.1.


Theorem 2.2 (PF sufficient condition for well-posedness with CONE activation). Assumethat φ is a componentwise non-expansive (CONE) map, as defined in subsection 2.1. Then, Ais well-posed with respect to φ if λpf(|A|) < 1, in which case, for any n-vector b, the solutionto the equation (2.3) can be computed via the fixed-point iteration

(2.4) x(0) = 0, x(t+ 1) = φ(Ax(t) + b), t = 0, 1, 2, . . . .

The theorem is a direct consequence of the global contraction mapping theorem [44, p.83].We provide a proof in Appendix A. A few remarks are in order.

Remark 2.3. The fixed-point iteration (2.4) has linear convergence; each iteration is amatrix-vector product, hence the complexity is comparable to that of a forward pass througha network of similar size.

Remark 2.4. The PF condition λpf(|A|) < 1 is not convex in A, but the convex condition‖A‖∞ < 1, is sufficient, in light of the bound ‖A‖∞ ≥ λpf(|A|).

Remark 2.5. The PF condition of Theorem 2.2 is conservative. For example, a triangularmatrix A is well-posed with respect to the ReLU and if only if diag(A) < 1, a consequence ofthe upcoming Theorem 2.8. The corresponding equilibrium equation can then be solved viathe backward recursion

xn =(bn)+

1−Ann, xi =

1

1−Aii(bi +

∑j>i

Aijxj)+, i = n− 1, . . . , 1.

Such a matrix does not necessarily satisfy the PF condition; we can have in particular A11 <−1, which implies λpf(|A|) > 1.

Remark 2.6. The well-posedness property is invariant under row and column permutation,provided φ acts componentwise. Precisely, if A is well-posed with respect to a componentwiseCONE map φ, then for any n × n permutation matrix P , PAP> is well-posed with respectto φ. The PF sufficient condition is also invariant under row and column permutations.

In some contexts, the map φ does not satisfy the componentwise non-expansiveness con-dition, but the weaker blockwise Lipschitz continuity (BLIP) defined in subsection 2.1. Theprevious theorem can be extended to this case, as follows. We partition the A matrix accord-ing to the tuple (n1, . . . , nL), into blocks Aij ∈ Rni×nj , 1 ≤ i, j ≤ L, and define a L×L matrixof induced norms, with elements for l, h ∈ [L] given by

(N(A))ij := ‖A‖pj→pi = maxξ‖Aξ‖pi : ‖ξ‖pj ≤ 1.(2.5)

Theorem 2.7 (PF sufficient condition for well-posedness for BLIP activation). Assume thatφ satisfies the BLIP condition, as defined in subsection 2.1. Then, A is well-posed with respectto φ if

λpf(ΓN(A)) < 1,

where Γ := diag(γ), with γ the vector of Lipschitz constants, and N(·) is the matrix of inducednorms defined in (2.5). In this case, for any n-vector b, the solution to the equation (2.3) canbe computed via the fixed-point iteration (2.4).

Proof. See Appendix B.

6 EL GHAOUI ET AL.

2.4. Composition of implicit models. Implicit models can be easily composed via matrixalgebra. Sometimes, the connection preserves well-posedness, thanks to the following result.

Theorem 2.8 (Well-posedness of block-triangular matrices, componentwise activation). As-sume that the activation map φ acts componentwise. The upper block-triangular matrix

A :=

(A11 A12

0 A22

)with Aii ∈ Rni×ni, i = 1, 2, is well-posed with respect to φ if and only if its the diagonal blocksA11, A22 are.

Proof. See Appendix C.

This result establishes the fact stated previously, that when φ is the ReLU, an upper-triangularmatrix A ∈ WP(φ) if and only if diag(A) < 1. A similar result holds with the lower block-triangular matrix

A :=

(A11 0A21 A22

),

where A12 ∈ Rn1×n2 is arbitrary. It is possible to extend this result to activation maps φ thatsatisfy the block Lipschitz continuity (BLIP) condition, in which case we need to assume thatthe partition of A into blocks is consistent with that of φ. As seen later, this feature arisesnaturally when composing implicit models from well-posed blocks.

Theorem 2.9 (Well-posedness of block-triangular matrices, blockwise activation). Assumethat the matrix A can be written as

A :=

(A11 A12

0 A22

)with Aii ∈ Rni×ni, i = 1, 2, and φ acts blockwise accordingly, in the sense that there exist twomaps φ1, φ2 such that φ((z1, z2)) = (φ1(z1), φ2(z2)) for every zi ∈ Rni, i = 1, 2. Then A iswell-posed with respect to φ if and only if for i = 1, 2, Aii is well-posed with respect to φi.

(A2 B2

C2 D2

)

φ2

y2(u1)

z2 x2 (A1 B1

C1 D1

)

φ1

u2 = y1(u1)u1

z1 x1

Figure 3. Cascade connection of two implicit models.

Using the above results, we can preserve well-posedness of implicit models via composi-tion. For example, given two models with matrix parameters (Ai, Bi, Ci, Di) and activationfunctions φi, i = 1, 2, we can consider a “cascaded” prediction rule:

y2 = C2x2 +D2u2 where u2 = y1 = C1x1 +D1u1, where xi = φi(Aixi +Biui), i = 1, 2.


The above rule can be represented as (1.1a), with x = (x2, x1), φ((z2, z1)) = (φ2(z2), φ1(z1))and (

A B

C D

)=

A2 B2C1 B2D1

0 A1 B1

C2 D2C1 D2D1

.

Due to Theorem 2.9, the cascaded rule is well-posed for the componentwise map with valuesφ(z1, z2) = (φ1(z1), φ2(z2)) if and only if each rule is.

A similar result holds if we put two or more well-posed models in parallel, and do a(weighted) sum the outputs. With the above notation, setting y(u1, u2) = y1(u1) + y2(u2)leads to a new implicit model that is also well-posed. Other possible connections includeconcatenation: y(u) = (y1(u), y2(u)), and affine transformations (a special case of cascadeconnection where one of the systems has no activation). We leave the details to the reader.

In both cascade and parallel connections, the triangular structure of the matrix A of thecomposed system ensures that the PF sufficient condition for well-posedness is satisfied forthe composed system if and only if it holds for each sub-system.

Multiplicative connections are in general are not Lipschitz-continuous, unless the inputsare bounded. Precisely, consider two activation maps φi that are Lipschitz-continuous withconstant γi and are bounded, with |φi(v)| ≤ ci for every v, i = 1, 2; then, the multiplicativemap

(u1, u2) ∈ R2 → φ(u) = φ1(u1)φ2(u2)

is Lipschitz-continuous with respect to the l1-norm, with constant γ := max{c2γ1, c1γ2}. Suchconnections arise in the context of attention units in neural networks, which use (bounded)activation maps such as tanh.

Figure 4. Feedback connection of two implicit models.

Finally, feedback connections are also possible. Consider two well-posed implicit systems:

yi = Cixi +Diui, xi = φi(Aixi +Bui), i = 1, 2.

8 EL GHAOUI ET AL.

Now let us connect them in a feedback connection: the combined system is described bythe implicit rule (1.1), where u1 = u + y2, u2 = y1 = y. The feedback system is also animplicit model of the form (1.1), with appropriate matrices (A,B,C,D), and activation mapacting blockwise: φ(z1, z2) = (φ1(z1), φ2(z2)) and state (x1, x2). In the simplified case whenD1 = D2 = 0, the feedback connection has the model matrix

(A B

C D

)=

A1 B1C2 B1

B2C1 A2 0

C1 0 0

.

Note that the connection is not necessarily well-posed when, even if A1 ∈ WP(φ1), A2 ∈WP(φ2).

2.5. Scaling implicit models. Assume that the activation map φ is componentwise non-expansive (CONE) and positively homogeneous, as is the ReLU, or its leaky version. Consideran implicit model of the form (1.1), and assume it satisfies the PF sufficient condition for well-posedness of Theorem 2.2: λpf(|A|) ≤ κ, where 0 ≤ κ < 1 is given. Then, there is anotherimplicit model with the same activation map φ, matrices (A′, B′, C ′, D′), which has the sameprediction rule (i.e. y′(u) = y(u), ∀u), and satisfies ‖A′‖∞ < 1.

This result is a direct consequence of the following expression of the PF eigenvalue as anoptimally scaled l∞-norm, known as the Collatz-Wielandt formula, see [39, p. 666]:

(2.6) λpf(|A|) = infS‖SAS−1‖∞ : S = diag(s), s > 0.

When the eigenvalue is simple, the optimal scaling vector is positive: s > 0, and the newmodel matrices are obtained by diagonal scaling:

(2.7)

(A′ B′

C ′ D′

)=

(SAS−1 SB

CS−1 D′

),

where S = diag(s), with s > 0 a Perron-Frobenius eigenvector of |A|.In a training problem, this result allows us to consider the convex constraint ‖A‖∞ < 1

in lieu of its Perron-Frobenius eigenvalue counterpart. This result also allows us to rescaleany given implicit model, such as one derived from deep neural networks, so that the normcondition is satisfied; we will exploit this in our robustness analyses section 4.

3. Implicit models of deep neural networks. A large number of deep neural networkscan be modeled as implicit models. Our goal here is to show how to build a well-posed implicitmodel for a given neural network, assuming the activation map satisfies the componentwisenon-expansiveness (CONE), or the more general block Lipschitz-continuity (BLIP) condition,as detailed in 2.1. Thanks to the composition rules of subsection 2.4, it suffices to modelindividual layers, since a neural network is just a cascade connection of such layers. The blockLipschitz structure then emerges naturally as the result of composing the layers.

We will find that the resulting models always have a strictly (block) upper triangularmatrix A, which automatically implies that these models are well-posed; in fact the equilibriumequation can be simply solved via backwards substitution. In turn, models with strictly


upper triangular structure also naturally satisfy the PF sufficient condition for well-posedness:for example, in the case of a componentwise non-expansive map φ, the matrix |A| arisingin Theorem 2.2 is also strictly upper triangular, and therefore all of its eigenvalues are zero.A similar result also holds for the case when φ is block Lipschitz continuous, as definedin subsection 2.1.

3.1. Basic operations.Activation at the output. It is common to have an activation at the output level, as in the

rule

(3.1) y(u) = φo(Cx+Du), x = φ(Ax+Bu),

with φo the output activation function. We can represent this rule as in (1.1), by introducinga new state variable: with φ := (φo, φ),(

zx

)= φ

((0 C0 A

)(zx

)+

(DB

)u

), y = z.

The rule is well-posed with respect to φ if and only if the matrix A is, with respect to φ.Bias terms and affine rules. The affine rule

y = Cx+Du+ d, x = φ(Ax+Bu+ b),

where d ∈ Rq, b ∈ Rn, is handled by simply appending a 1 at the end of the input vector u.A related transformation is useful with activation functions φ, such as the sigmoid, that

do not satisfy φ(0) = 0. Consider the rule

y = Cx+Du, x = φ(Ax+Bu).

Defining φ(·) := φ(·) − φ(0) and using the state vector x := x + φ(0), we may represent theabove rule as

y = Cx+(D −Cφ(0)

)(u1

), x = φ(Ax+

(B −Aφ(0)

)(u1

)).

Again, the rule is well-posed if and only if the matrix A is.Batch normalization. Batch normalization consists in including in the prediction rule a

normalization step using some estimates of input mean u and variance σ > 0 that are basedon a batch of training data. The parameters u, σ are given at test time. This step is a simpleaffine rule:

y = Du+ d, where [D, d] := diag(σ)−1[I,−u].

3.2. Fully connected feedforward neural networks. Consider the following predictionrule, with L > 1 fully connected layers:

(3.2) y(u) = WLxL, xl+1 = φl(Wlxl), x0 = u.

10 EL GHAOUI ET AL.

Here Wl ∈ Rnl+1×nl and φl : Rnl+1 → Rnl+1 , l = 1, . . . , L, are given weight matrices andactivation maps, respectively. We can express the above rule as (1.1a), with x = (xL, . . . , x1),and

(3.3)

(A B

C D

)=

0 WL−1 . . . 0 0

0. . .

......

. . . W1 00 W0

WL 0 . . . 0 0

,

and with an appropriately defined blockwise activation function φ, defined as operating on ann-vector z = (zL, . . . , z1) as φ(z) = (φL(zL), . . . , φ1(z1)). Due to the strictly upper triangularstructure of A, the system is well-posed, irrespective of A; in fact the equilibrium equationx = φ(Ax + Bu) is easily solved via backward block substitution, which corresponds to asimple forward pass through the network.

3.3. Convolutional layers and max-pooling. A single convolutional layer can be repre-sented as a linear map: y = Du, where u is the input, D is a matrix that represents the(linear) convolution operator, with a “constant-along-diagonals”, Toeplitz-like structure. Forexample a 2D convolution with a 2D kernel K takes a 3 × 3 matrix U and produces a 2 × 2matrix Y . With

U =

u1 u2 u3u4 u5 u6u7 u8 u9

, K =

(k1 k2k3 k4

),

the convolution can be represented as y = Du, with y, u vectors formed by stacking the rowsof U, Y together, and

D =

k1 k2 0 k3 k4 0 0 0 00 k1 k2 0 k3 k4 0 0 00 0 0 k1 k2 0 k3 k4 00 0 0 0 k1 k2 0 k3 k4

.

Often, a convolutional layer is combined with a max-pooling operation. The latter forms a

Figure 5. A max-pooling operation: the smaller image contains the maximal pixel values of each colored area.

down-sample of an image, which is a smaller image that contains the largest pixel values of


specific sub-areas of the original image. Such an operation can be represented as

yj = max1≤i≤h

(Bju)i, j ∈ [q].

In the above, p = qh, and the matrices Bj ∈ Rh×p, j ∈ [q], are used to select specific sub-areasof the original image. In the example of Figure 5, the number of pixels selected in each areais h = 4, the output dimension is q = 4, that of the input is p = qh = 16; vectorizing imagesrow by row:

B1

B2

B3

B4

= diag(M,M) ∈ R16×16, M :=

I2 0 0 00 0 I2 00 I2 0 00 0 0 I2

∈ R8×8,

where I2 is the 2× 2 identity matrix.Define the mapping φ : Rn → Rn, where n = p, as follows. For a p-vector z decomposed

in q blocks (z1, . . . , zq), we set φ(z1, . . . , zq) = (max(z1), . . . ,max(zq), 0, . . . , 0). (The paddedzeroes are necessary in order to make sure the input and output dimensions of φ are the same).We obtain the implicit model

y = Cφ(Bu) = Cx, where x := φ(Bu).

Here C is used to select the top q elements,

C =(Iq 0 . . . 0

), B :=

(B>1 . . . B>q

)>.

The Lipschitz constant of the max-pooling activation map φ, with respect to the l∞-norm, is1, hence it verifies the BLIP condition of subsection 2.1.

3.4. Residual nets. A building block in residual networks involves the relationship illus-trated in Figure 6. Mathematically, a residual block combines two linear operations, withnon-linearities in the middle and the end, and adds the input to the resulting output:

y = φ2(u+W2φ1(W1u)),

where W1,W2 are weight matrices of appropriate size. The above is a special case of theimplicit model (1.1): defining the blockwise map φ(z2, z1) = (φ2(z2), φ1(z1)),(

x2x1

)= φ

((0 W2

0 0

)(x2x1

)+

(IW1

)u

), y = x2.

Figure 7 displays the model matrix A for a 20-layer residual network. Convolutionallayers appear as matrix blocks with Toeplitz (constant along diagonal) structure; residualunit correspond to the straight lines on top of the blocks. The network only uses the ReLUmap, except for the last layer, which uses a softmax map.

12 EL GHAOUI ET AL.

Figure 6. Building block of residual networks. Figure 7. The matrix A for a 20-layer residualnetwork. Nonzero elements are colored in blue.

3.5. Recurrent units. Recurrent neural nets (RNNs) can be represented in an “unrolled”form as shown in [28], which is the perspective we will consider here. The input to an RNNblock is a sequence of vectors {u1, · · · , uT }, where for every t ∈ [T ], ut ∈ Rp. At each timestep, the network takes in a input ut and the previous hidden state ht−1 to produce the nexthidden state ht; the hidden state ht defines the state space or “memory” of the network. Therule can be written as,

ht = φH(WHht−1 +WIut), yt = φO(WOht), t = 1, . . . , T,(3.4)

Then, equations (3.4) can be expressed as an implicit model (3.1), with x = (hT , . . . , h0),u = (uT , . . . , u1), and weight matrices

(A B

C 0

)=

0 WH · · · 0 0 WI · · · 0 0

0 0 WH...

... 0 WI...

.... . .

. . . 0. . .

. . . 00 WH 0 WI

WO 0 · · · 0 0 0 · · · · · · 0

(3.5)

where A,B are strictly upper block triangular and share the same block diagonal sub-matrices,WH and WI respectively, shared across all hidden states and inputs.

Note that the above approach leads to matrices A,B,C,D with a special structure; duringtraining and test time, it is possible to exploit that structure in order to avoid “unrolling” therecurrent layers.

3.6. Multiplicative units: LSTM, attention mechanisms and variants. Some deep learn-ing network architectures use multiplicative units. As mentioned in subsection 2.4, multiplica-tion between variables in not Lipschitz-continuous in general, unless the inputs are bounded.Luckily, most of the multiplicative units used in practice involve bounded inputs.

In the case of Long Short-Term Memory (LSTM) [55] or gated recurrent units (GRUs)[15], a basic building block involves the product of two input variables after each one passes


through a bounded non-linearity. Precisely, the output takes the form

y = φ(u) := φ1(u1)φ2(u2),

where u1, u2 ∈ R, and φ1, φ2 are both bounded (scalar) non-linearities.Attention models [7] use componentwise vector multiplication, usually involving a softmax

operation (2.2):y = φ(u) = φ1(u1)� SoftMax(u2),

where φ1 is a bounded, componentwise Lipschitz-continuous activation map, such as the sig-moid; and W is a matrix of weights. The map φ is then Lipschitz-continuous with respect tothe l1-norm. As shown in Section 2.2, we can formulate these multiplicative units as well-posedimplicit models.

4. Robustness. In this section, our goal is to analyze the robustness properties of a givenimplicit model (1.1a). We seek to bound the state, output and loss function, under unknown-but-bounded inputs. This robustness analysis task is of interest in itself for diagnosis orfor generating adversarial attacks. It will also inform our choice of appropriate penaltiesor constraints in the training problem. We assume that the activation map is BlockwiseLipschitz, and that the matrix A of the implicit model satisfies the sufficient conditions forwell-posedness outlined in Theorem 2.7.

4.1. Input uncertainty models. We assume that the input vector is uncertain, and onlyknown to belong to a given set U ⊆ Rp. Our results apply to a wide variety of sets U ; we willfocus on the following two cases.

A first case corresponds to a box:

(4.1) Ubox :={u ∈ Rp : |u− u0| ≤ σu

}.

Here, the p-vector σu > 0 is a measure of componentwise uncertainty affecting each datapoint, while u0 corresponds to a vector of “nominal” inputs. The following variant limits thenumber of changes in the vector u:

(4.2) Ucard :={u ∈ Rp : |u− u0| ≤ σu, Card(u− u0) ≤ k

},

where Card denotes the cardinality (number of non-zero components) in its vector argument,and k < p is a given integer.

4.2. Box bounds on the state vector. Assume first that φ is a CONE map, and thatthe input is only known to belong to the box set Ubox (4.1). We seek to find componentwisebounds on the state vector x, of the form |x− x0| ≤ σx, with x and x0 the unique solution toξ = φ(Aξ +Bu) and ξ = φ(Aξ +Bu0) respectively, and σx > 0. We have

|x− x0| = |φ(Ax+Bu)− φ(Ax0 +Bu0)|≤ |A(x− x0) +B(u− u0)| ≤ |A||x− x0|+ |B(u− u0)|,

which shows in particular that

‖x− x0‖∞ ≤ ‖A‖∞‖x− x0‖∞ + ‖B(u− u0)‖∞,

14 EL GHAOUI ET AL.

hence, provided ‖A‖∞ < 1, we have:

(4.3) ‖x− x0‖∞ ≤‖|B|σu‖∞1− ‖A‖∞

.

With the cardinality constrained uncertainty set Ucard (4.2), we obtain

(4.4) ‖x− x0‖∞ ≤δ

1− ‖A‖∞, δ := max

1≤i≤nsk(σu � |B|>ei),

with ei the i-th unit vector in Rn, and sk the sum of the top k entries in a vector.We may refine the analysis above, with a “box” (componentwise) bound.

Theorem 4.1 (Box bound on state, CONE map). Assuming that φ is componentwise non-expansive, and λpf(|A|) < 1. then, I − |A| is invertible, and

(4.5) |x− x0| ≤ σx := (I − |A|)−1|B|σu.

Proof. See Appendix D.

Note that the box bound can be computed via (I − |A|)σx = |B|σu, as the limit point of thefixed-point iteration

σx(0) = 0, σx(t+ 1) = |A|σx(t) + |B|σu, t = 0, 1, 2, . . . ,

which converges since λpf(|A|) < 1.The case when φ is block Lipschitz (BLIP) involves the matrix of norms N(A) defined

in (2.5), as well as a related matrix defined for the input matrix B. Decomposing B intoblocks B = (Bli)l∈[L],i∈[p], with Bli ∈ Rnl , l ∈ [L], we define the L× p matrix of norms

N(B) := (‖Bli‖pl)l∈[L], i∈[p].(4.6)

Theorem 4.2 (Box bound on the vector norms of the state, BLIP map). Assuming that φ isBLIP, and the corresponding sufficient well-posedness condition of Theorem 2.7 is satisfied.Define Γ := diag(γ), where γ is the vector of Lipschitz constants. Then, I − ΓN(A) isinvertible, and

(4.7) η(x− x0) ≤ (I − ΓN(A))−1ΓN(B)σu,

where the vector of norms function η(·) is defined in (1.2).


4.3. Bounds on the output and the sensitivity matrix. The above allows us to analyzethe effect of effect of input noise on the output vector y. Let us assume that the functionφ satisfies the CONE (componentwise non-expansiveness) condition (2.1). In addition, weassume that the stronger condition for well-posedness, ‖A‖∞ < 1, is satisfied. (we can alwaysrescale the model so as to ensure that property, provided λpf(|A|) < 1.) For the implicitprediction rule (1.1), we then have

∀ u, u0 : ‖y(u)− y(u0)‖∞ ≤ κ‖u− u0‖∞,


where

(4.8) κ :=‖B‖∞‖C‖∞1− ‖A‖∞

+ ‖D‖∞.

The above shows that the prediction rule is Lipschitz-continuous, with a constant boundedabove by κ. This result motivates the use of the ‖ · ‖∞ norm as a penalty on model matricesA,B,C,D, for example via a convex penalty that bounds the Lipschitz constant,

(4.9) κ ≤ P (A,B,C,D) :=1

2

‖B‖2∞ + ‖C‖2∞1− ‖A‖∞

+ ‖D‖∞.

We can refine this analysis with the following theorem.

Theorem 4.3 (Box bound on the output, CONE map). Assuming that φ is componentwisenon-expansive, and λpf(|A|) < 1. Then, I − |A| is invertible, and

(4.10) ∀ u, u0 : |y(u)− y(u0)| ≤ S|u− u0|,

where the (non-negative) matrix

S := |C|(I − |A|)−1|B|+ |D|

is a “sensitivity matrix” of the implicit model with a CONE map.

Figure 8. Sensitivity matrix for a 3-layer neural network with q = 10 outputs and n = 1094 states.

The sensitivity matrix is a way to visualize the input-output properties of a given model.Figure 8 shows the sensitivity matrix for a given neural network for classification that hasq = 10 classes, with each row represented as an image. We observe that some classes havehigher sensitivity values than others, which can be used to focus an attack on the model.

Our refined analysis suggests a “natural” penalty to use during the training phase in orderto improve robustness, namely ‖S‖∞.

As before, the approach can be generalized to the case when φ is block Lipschitz (BLIP).Decomposing C into column blocks C = (C1, . . . , CL), with Cl ∈ Rq×nl , l ∈ [L], we define thematrix of (dual) norms

N(C) := (‖Cil‖p∗l )l∈[L], i∈[q],

where p∗l := 1/(1 − 1/pl), l ∈ [L]. Also recall the corresponding matrix of norms related toB (4.6).

16 EL GHAOUI ET AL.

Theorem 4.4 (Box bound on the output, BLIP map). Assuming that φ is a BLIP map, andthat the sufficient condition for well-posedness λpf(ΓN(A)) < 1 is satisfied, with Γ := diag(γ)containing the Lipschitz constants associated with φ. Then, I − ΓN(A) is invertible, and

(4.11) ∀ u, u0 : |y(u)− y(u0)| ≤ S|u− u0|,

where the (non-negative) q × p matrix

S := N(C)(I − ΓN(A))−1ΓN(B) + |D|

is a “block sensitivity matrix” of the implicit model with a BLIP map.


4.4. Worst-case loss function. In this section, we assume that the map φ satisfies theCONE property (2.1). We seek to find bounds on the loss function evaluated between agiven “target” y ∈ Rp and a prediction y, using the “box” bounds on the state and outputof subsection 4.2 and subsection 4.3.

Consider the case of squared Euclidean loss function, that is:

L(y, y) = ‖y − y‖22.

The worst-case loss can be computed based on the box bound (4.10), as

Lwc(y, y0) := max|y−y0|≤σy

L(y, y) = ‖|y − y0|+ σy‖22.

We can also consider the cross entropy loss function, that is:

L(y, y) = log(

q∑i=1

eyi)− y>y,

where y ≥ 0, 1>y = 1 is given.Assume that D = 0 for simplicity, so that y0 := Cx0 is the nominal output. Using the

norm bound (4.3), we have ‖x − x0‖∞ ≤ ρ, for an appropriate ρ > 0. The correspondingworst-case loss is:

Lwc(y, y0) := maxδx : ‖δx‖∞≤ρ

L(y, y0 + Cδx)

= maxδx : ‖δx‖∞≤ρ

log(

q∑i=1

ey0i +c

>i δx)− y>(y0 + Cδx),

with c>i the i-th row of C. The above may be hard to compute, but we can work with abound, based on evaluating the maximum above for each of the two terms independently. Weobtain

Lwc(y, y0) ≤ log(

q∑i=1

ey0i +ρ‖ci‖1)− y>y0 + ρ‖C>y‖1.


Since y ≥ 0 and 1>y = 1, we have ‖C>y‖1 ≤ ‖C‖∞, and we obtain the bound

L(y, y0) ≤ Lwc(y, y0) ≤ L(y, y0) + 2ρ‖C‖∞,

which again involves the norm ‖C‖∞ as a penalty. We can refine the analysis, based on thebox bound (4.10):

Lwc(y, y0) ≤ log(

q∑i=1

ey0i +σy,i)− y>y0 + y>σy.

4.5. LP relaxation for CONE maps. The previous bounds does not provide a direct wayto generate an adversarial attack, that is, a feasible point u ∈ U that has maximum impacton the state. In some cases it may be possible to refine our previous box bounds via anLP relaxation, which has the advantage of suggesting a specific adversarial attack. Here werestrict our attention to the ReLU activation: φ(z) = max(z, 0) = z+, which is a CONE map.

We consider the problem

(4.12) p∗ := maxx,u∈U

∑i∈[n]

fi(xi) : x = z+, z = Ax+Bu, |x− x0| ≤ σx,

where fi’s are arbitrary functions. Setting fi(ξ) = (ξ − x0i )2, i ∈ [n], leads to the problem offinding the largest discrepancy (measured in l2-norm) between x and x0; setting fi(ξ) = −ξ,i ∈ [n], finds the minimum l1-norm state. By construction, our bound can only improve onthe previous state bound, in the sense that

p∗ ≤∑i∈[n]

max|α|≤1

fi(x0i + ασx,i).

Our result is expressed for general sets U , based on the support function σU with values forb ∈ Rp given by

(4.13) σU (b) := maxu∈U

b>u.

Note that this function depends only on the convex hull of the set U . In the case of the boxset Ubox given in (4.1), there is a convenient closed-form expression:

σUbox(b) = b>u0 + σ>u |b|.

Likewise for the cardinality-constrained set Ucard defined in (4.2), we have

sUk(b) = maxu∈Ucard

b>u = b>u0 + sk(σu � |b|),

where sk is the sum of the top k entries of its vector argument—a convex function.The only coupling constraint in (4.12) is the affine equation, which suggests the following

relaxation.

18 EL GHAOUI ET AL.

Theorem 4.5 (LP bound on the state). An upper bound on the objective of problem (4.12)is given by

p∗ ≤ p := minλ

σU (B>λ) +∑i∈[n]

gi(λi, (A>λ)i),

where sU is the support function defined in (4.13), and gi, i ∈ [n], are the convex functions

gi : (α, β) ∈ R2 → gi(α, β) := maxζ : |ζ+−x0i |≤σx,i

fi(ζ+)− αζ + βζ+, i ∈ [n].

If the functions gi, i ∈ [n] are closed, we have the bidual expression

p := maxx, u∈U

−∑i∈[n]

g∗i (−(Ax+Bu)i, xi),

where g∗i is the conjuguate of gi, i ∈ [n].

Proof. See Appendix E.

Consider for example fi(ξ) = ciξ, i ∈ [n], where c ∈ Rn is given: the problem is

p∗ = maxx, u∈U

c>x : x = z+, z = Ax+Bu, |x− x0| ≤ σx.

Let x := (x0−σx)+, x := x0+σx, so that the constraint x ≥ 0, |x−x0| ≤ σx writes x ≤ x ≤ x.We have

gi(α, β) = maxζ

(β + ci)ζ+ − αζ : xi ≤ ζ+ ≤ xi

= maxζ+,ζ−

α(ζ− − ζ+) + (β + ci)ζ+ : ζ± ≥ 0, xi ≤ ζ+ ≤ xi

= max(xi(β − α+ ci), xi(β − α+ ci)) if α ≤ 0, +∞ otherwise.

This leads to the dual

p∗ ≤ p := minλ≥0

σU (B>λ) +∑i∈[n]

max(xi((A>λ)i + ci − λi), xi((A>λ)i + ci − λi).

For every i ∈ [n], the function gi is closed, with conjugate given by

g∗i (v, w) = maxα≤0, β

vα+ wβ −max(xi(β − α+ ci), xi(β − α+ ci))

= minτ : |τ |≤1

maxα≤0, β

vα+ wβ − τxi(β − α+ ci)− (1− τ)xi(β − α+ ci)

= w if − v ≤ w ∈ [xi, xi], +∞ otherwise.

We obtain a “natural” relaxation:

p∗ ≤ p = maxx, u∈U

c>x : x ≥ Ax+Bu, x ≥ 0, |x− x0| ≤ σx.


We consider a variant of the previous problem, where we seek a sparse adversarial attack:the cardinality of changes in the input is constrained in the set Ucard.

p∗ = maxx, u∈Ucard

c>x : x = z+, z = Ax+Bu, |x− x0| ≤ σx,

where Ucard is the set (4.2), with k ∈ [p] given. The relaxation expresses as

p∗ ≤ p := minλ

λ>Bu0 + sk(σu � |B>λ|) +∑i∈[n]

max(xi((A>λ)i + ci − λi), xi((A>λ)i + ci − λi).

The bidual writes

p∗ ≤ p = maxx, u

c>x : x ≥ Ax+Bu, x ≥ 0, |x− x0| ≤ σx,‖diag(σu)−1(u− u0)‖1 ≤ k, |u− u0| ≤ σu.

where diag(·) is a diagonal matrix formed with the entries of its vector argument. The aboveprovides a heuristic for a sparse adversarial attack, based on any vector u that is optimal forthe above.

4.6. SDP relaxations for maximal state discrepancy with ReLU activation. Assumeagain that φ is the ReLU, and consider the box uncertainty set Ubox (4.1). Our goal is to findthe largest state discrepancy, measured in the Euclidean norm:

p∗ = maxx, u∈Ubox

‖x− x0‖22 : x = φ(Ax+Bu), |x− x0| ≤ σx,(4.14)

Based on a quadratic representation of φ, we can arrive at a SDP relaxation of (4.14). Wefirst use the fact that for any two n-dimensional vectors x, z:

(4.15) x = φ(z)⇐⇒ x ≥ 0, x ≥ z, x� x ≤ x� z,

where � is the Hadamard (component-wise) product. Next, we use the fact that any boundthe form l ≤ z − z0 ≤ u where z, l, u are vectors can be rewritten as

l ≤ z − z0 ≤ u⇐⇒ (z − z0 − l)� (z − z0 − u) ≤ 0(4.16)

⇐⇒ z � z ≤ (2z0 + l + u)� z − (l + z0)� (u+ z0)(4.17)

These two reformulations allow us to represent (4.14) as a non-convex QCQP:

p∗ = maxx,u‖x− x0‖22(4.18)

s.t. x ≥ 0, x ≥ Ax+Bu, x� x ≤ x� (Ax+Bu)

x� x ≤ (2x0)� x− (−σx + x0)� (σx + x0)

u� u ≤ (2u0)� u− (−σu + u0)� (σu + u0)

We can derive an upper bound on the value of this QP by solving the Lagrange dual of theproblem, leading to a semidefinite program. However, it is a well known result of quadratic

20 EL GHAOUI ET AL.

optimization (for example see Appendix B of [10]) that the bidual of a non-convex QCQPsimply becomes the canonical rank relaxed version of (4.14) when we redefine the objectivein terms of a new variable Z = zz>, where z = [x, u, 1]>. Making this substitution andrelaxing the equality constraint into a semidefinite constraint, the bidual of (4.14) becomes

p∗∗ = maxX,x,u

Tr(X[1, 1])− 2x>x0 + ‖x0‖22

(4.19)

s.t.

Xxu

x> u> 1

� 0,

x ≥ 0, x ≥ Ax+Bu, diag(X[1, 1]) ≤ diag(AX[1, 1]) + diag(BX[2, 1]),

diag(X[2, 2]) ≤ (2u0) ◦ u− (−σu + u0) ◦ (σu + u0),

diag(X[1, 1]) ≤ (2x0) ◦ x− (−σx + x0) ◦ (σx + x0),

where used a rank relaxation involvingxu1

xu1

> =

xx> xu> xux> uu> ux> u> 1

=

Xxu

x> u> 1

By construction, (4.19) is convex. One immediate reason why we wish to solve the bidualas opposed to the Lagrange dual of (4.14) is that we are able to get a candidate state xand attack u, which can be viewed, respectively, as the perturbed state resulting from anadversarial attack on our network. This attack can be retrieved from the bidual in differentways. The first one would be to look at the optimal u and the second would be to do a rankone decomposition of X and extract u from it.

5. Sparsity and Model Compression. In this section, we examine the role of sparsity andlow-rank structure in implicit deep learning, specifically in the model parameter matrix

M :=

(A BC D

).

We assume that the activation map φ is a CONE map, as defined in subsection 2.1. Most ofour results can be generalized to BLIP maps.

Since a CONE map acts componentwise, the prediction rule (1.1a) is invariant underpermutations of the state vector, in the sense that, for any n × n permutation matrix, thematrix diag(P, I)M diag(P>, I) represents the same prediction rule as M given above.

Various kinds of sparsity of M can be encouraged in the training problem, with appropriatepenalties. For example, we can use penalties that encourage many elements in M to be zero;the advantage of such “element-wise” sparsity is, of course, computational, since sparsity inmatrices A,B,C,D will allow for computational speedups at test time. Another interestingkind of sparsity is rank sparsity, which refers to the case when model matrices are low-rank.


Next, we examine the benefits of row- (or, column-) sparsity, which refers to the fact thatentire rows (or, columns) of a matrix are zero. Note that column sparsity in a matrix N canbe encouraged with a penalty in the training problem, of the form

P(N) =∑i

‖Nei‖α

where α > 1. Row sparsity can be handled via P(N>).

5.1. Deep feature selection. We may use the implicit model to select features. Any zerocolumn in the matrix (B>, D>)> means that the corresponding element in an input vectordoes not play any role in the prediction rule. We may thus use a column-norm penalty in thetraining problem, in order to encourage such a sparsity pattern:

(5.1) P(B,D) =

p∑j=1

∥∥∥∥(BD)ej

∥∥∥∥α

,

with α > 1.

5.2. Dimension reduction via row- and column-sparsity. Assume that the matrix A isrow-sparse. Without loss of generality, using permutation invariance, we can assume that Mwrites

M =

A11 A12 B1

0 0 B2

C1 C2 D

,

where A11 is square of order n1 < n. We can then decompose x accordingly, as x = (x1, x2)with x1 ∈ Rn1 , and the above implies x2 = φ(B2u). The prediction rule for an input u ∈ Rpthen writes

y(u) = C1x1 +Du, x1 = φ(A11x1 +A12φ(B2u) +B1u).

The rule only involves x1 as a true hidden feature vector. In fact, the row sparsity of A allowsfor a computational speedup, as we simply need to solve a fixed-point equation for the vectorwith reduced dimensions, x1.

Further assume that (A,B) is row-sparse. Again without loss of generality we may putM in the above form, with B2 = 0. Then the prediction rule can be written

y(u) = C1x1 +Du, x1 = φ(A11x1 +B1u).

This means that the dimension of the state variable can be fully reduced, to n1 < n. Thus,row sparsity of (A,B) allows for a reduction in the dimension of the prediction rule.

Assume now that the matrix A is column-sparse. Without loss of generality, using per-mutation invariance, we can assume that M writes

M =

A11 0 B1

A21 0 B2

C1 C2 D

,

22 EL GHAOUI ET AL.

where A11 is square of order n1 < n. We can then decompose x accordingly, as x = (x1, x2)with x1 ∈ Rn1 . The above implies that the prediction rule for an input u ∈ Rp writes

y(u) = C1x1 + C2x2 +Du, x1 = φ(A11x1 +B1u), x2 = φ(A21x1 +B2u).

Thus, column-sparsity allows for a computational speedup, since x2 can be directly expressedas closed-form function of x1.

Now assume that (A>, C>)> is column-sparse. Again without loss of generality we mayput M in the above form, with C2 = 0. We obtain that the prediction rule does not need x2at all, so that the computation of the latter vector can be entirely avoided. This means thatthe dimension of the state variable can be fully reduced, to n1 < n. Thus, column sparsity of(A>, C>)> allows for a reduction in the dimension of the prediction rule.

To summarize, row or column sparsity of A allows for a computational speedup; if thecorresponding rows of B (resp. columns of C) are zero, then the prediction rule involves onlya vector of reduced dimensions.

5.3. Rank sparsity. Assume that the matrix A is rank k � n, and that a correspondingfactorization is known: A = LR>, with L,R ∈ Rn×k. In this case, for any p-vector u, theequilibrium equation x = φ(Ax + Bu) can be written as x = φ(Lz + Bu), where z := R>x.Hence, we can obtain a prediction for a given input u via the solution of a low-dimensionalfixed-point equation in z ∈ Rk:

z = R>φ(Lz +Bu).

It can be shown that, when φ is a CONE map, the above rule is well-posed if λpf(|R|T |L|) < 1.Once a solution z is found, we simply set the prediction to be y(u) = Cφ(Lz +Bu) +Du.

At test time, if we use fixed-point iterations to obtain our predictions, then the computa-tional savings brought about by the low-rank representation of A can be substantial, with aper-iteration cost going from O(n2), to O(kn) if we use the above.

5.4. Model error analysis. The above suggests to replace a given model matrix A with alow-rank or sparse approximation, denoted A0. The resulting state error can be then bounded,as follows. Assume that |A − A0| ≤ E, where E ≥ 0 is a known upper bound on thecomponentwise error in A. The following theorem, proved in Appendix F, provides relativeerror bounds on the state, provided the perturbed system satisfies the well-posedness conditionλpf(|A|) < 1. As before, we denote by x0, x the (unique) solutions to the unperturbed andperturbed equilibrium equations ξ = φ(A0ξ +Bu) and ξ = φ(Aξ +Bu), respectively.

Theorem 5.1 (Effect of errors in A). Assuming that φ is a CONE map, and that λpf(|A0 +E|) < 1. Then:

(5.2) |x− x0| ≤ (I − (|A0 + E|))−1Ex0.

Proof. See Appendix F.

6. Training Implicit Models.


6.1. Setup. We are now given an input data matrix U = [u1, . . . , um] ∈ Rp×m and re-sponse matrix Y = [y1, . . . , ym] ∈ Rq×m, and seek to fit a model of the form (1.1a), with Awell-posed with respect to φ, which we assume to be a BLIP map. We note that the rule (1.1a),when applied to a collection of inputs (ui)1≤i≤m, can be written in matrix form, as

Y (U) = CX +DU, where X = φ(AX +BU).

where U = [u1, . . . , um] ∈ Rp×m, and Y (U) = [y(u1), . . . , y(um)] ∈ Rq×m.We consider a training problem of the form

(6.1) minA,B,C,D,X

L(Y,CX +DU) + P(A,B,C,D) : X = φ(AX +BU), A ∈WP(φ).

In the above, L is a loss function, assumed to be convex in its second argument, and P is aconvex penalty function, which can be used to enforce a given (linear) structure (such as, Astrictly upper block triangular) on the parameters, and/or encourage their sparsity.

In practice, we replace the well-posedness condition by the sufficient condition of The-orem 2.7. As argued in subsection 2.5, the latter can be further replaced without loss ofgenerality with the easier-to-handle (convex) constraint; in the case of CONE maps, thiscondition is ‖A‖∞ ≤ κ, where κ ∈ (0, 1) is a hyper-parameter.

Examples of loss functions. For regression tasks, we may use the squared Euclidean loss:for Y, Y ∈ Rq×m,

L(Y, Y ) :=1

2‖Y − Y ‖2F .

For multi-class classification, a popular loss is a combination of negative cross-entropy withthe soft-max: for two q-vectors y, y, with y ≥ 0, y>1 = 1, we define

L(y, y) = −y> log

(ey∑qi=1 e

yi

)= log(

q∑i=1

eyi)− y>y.

We can extend the definition to matrices, by summing the contribution to all columns, eachcorresponding to a data point: for Y,Z ∈ Rq×m,

(6.2) L(Y, Y ) =

m∑j=1

log

(q∑i=1

eYij

)−

m∑j=1

q∑i=1

Yij Yij = log(1> exp(Y ))1−TrY >Y ,

where both the log and the exponential functions apply componentwise.Examples of penalty functions. Via an appropriate definition of P, we can make sure that

the model matrix A is well-posed with respect to φ, either imposing an upper triangularstructure for A, or via an l∞-norm constraint ‖A‖∞ < 1, or a Perron-Frobenius eigenvalueconstraint. Note that in the case of the CONE maps such as the ReLU, due to scale invarianceseen in subsection 2.5, we can always replace a Perron-Frobenius eigenvalue constraint with al∞-norm constraint.

Beyond well-posedness, the penalty can be used to encourage desired properties of themodel. For robustness, the convex penalty (4.9) can be used, provided we also enforce‖A‖∞ < 1. We may also use an l∞-norm bound on the sensitivity matrices S appearingin Theorem 4.3 or Theorem 4.4. For feature selection, an appropriate penalty may involvethe block norms (5.1); sparsity of the model matrices can be similarly handled with ordinaryl1-norm penalties on the elements of the model matrix M = (A,B,C,D).

24 EL GHAOUI ET AL.

Fenchel divergence formulation. Problem (6.1) can be equivalently written

minA,B,C,D,X

L(Y,CX +DU) + P(A,B,C,D) : Fφ(X,AX +BU) ≤ 0, A ∈WP(φ),

where Fφ is the so-called Fenchel divergence adapted to φ [24], applied column-wise to matrixinputs. In the case of the ReLU activation, for two given vectors x, z of the same size, we have

(6.3) Fφ(x, z) :=1

2x� x+

1

2z+ � z+ − x� z if x ≥ 0,

with � the componentwise multiplication. We can then define Fφ(X,Z) with X,Z two matrixinputs having the same number of columns, by summing over these columns. As seen in [24],a large number of popular activation maps can be expressed similarly.

We may also consider a relaxed form:

minA,B,C,D,X

L(Y,CX +DU) + P(A,B,C,D) + λ>Fφ(X,AX +BU), A ∈WP(φ),

where λ > 0 is a n-vector parameter.

6.2. Gradient Methods. Assuming that φ is a CONE map for simplicity, we consider theproblem

(6.4) minA,B,C,D,X

L(Y,CX +DU) + P(A,B,C,D) : X = φ(AX +BU), ‖A‖∞ ≤ κ,

where κ < 1 is given.We can solve (6.4) using (stochastic) projected gradient descent. The basic idea is to

update the model parameters M = (A,B,C,D) using stochastic gradients, by differentiatingthrough the equilibrium equation. It turns out that this differentiation requires solving anequilibrium equations involving a matrix variable, which can be very efficiently solved viafixed-point iterations. More precisely, the main effort in computing the gradient for a mini-batch of size say b consists in solving matrix equations in a n× b matrix V :

V = Ψ ◦(A>V + C>L

),

where the columns of L contains gradients of the loss with respect to y, and Ψ is a matrixwhose columns contain the derivatives of the activation map, both evaluated at a specifictraining point. Due to the fact taht A satisfies the PF sufficient condition for well-posednesswith respect to φ, the equation above has a unique solution; the matrix V can be computedas the limit point of the convergent fixed-point iterations

(6.5) V (t+ 1) = Ψ ◦(A>V (t) + C>L

), t = 0, 1, 2, . . .

The proof of the following result is given in Appendix G.

Theorem 6.1 (Computing gradients). Consider an implicit model (φ, A,B,C,D), with φa differentiable BLIP map. If A is well-posed with respect to φ, the gradients with respect tothe model parameter matrices A,B,C and D exist, and are uniquely given via the solution ofa fixed-point equation, as detailed in Appendix G.


In order to handle the well-posedness constraint ‖A‖∞ ≤ κ, the projected gradient methodrequires a projection at each step. This step corresponds to a sub-problem of the form:

(6.6) minA‖A−A0‖F : ‖A‖∞ ≤ κ,

with matrix A0 given. The above problem is decomposable across rows, and can be efficientlysolved using (vectorized) bisection, as detailed in Appendix H.

In some cases, it is desirable to obtain a (block) sparse model matrix M . One way toefficiently achieve sparsity is the use of a conditional gradient (Frank-Wolfe) algorithm thattypically only adds a few non-zeros elements to the model matrix M at each step. Whenupdating A for example, considering B,C,D fixed for simplicity, the method requires solvinga problem of the form

maxA

TrG>A : ‖A‖∞ ≤ κ,

where G is a gradient matrix. The problem can be solved in closed-form.

6.3. Block-coordinate descent methods. Block-coordinate descent methods use cyclicupdates, optimizing one matrix variable at a time, fixing all the other variables. These methodsare particularly interesting when the updates require solving convex problems. For instance,considering the training problem (6.4), given X, the problem is convex in the model matrixM . Then, given the model matrix M , finding X consists in a feasibility problem that can besolved with fixed-point iterations.

In the case of the Fenchel divergence problem relaxation, all the updates involve solvingconvex problems, as shown in [24], as the Fenchel divergence Fφ, for most of activation maps,is bi-convex in its two arguments—meaning that given the first argument fixed, it is convex inthe second and vice-versa. We refer to [24, 51] for more on Fenchel divergences in the contextof implicit deep learning and neural networks. We illustrate and give more detailed examplesof such methods in the next section.

7. Numerical experiments.

7.1. Learning real nonlinear functions via regression. We start by illustrating the abilityof the gradient method, as presented in subsection 6.2, to learn the parameters of the implicitmodel, with well-posedness enforced via a max-row-sum norm constraint, ‖A‖∞ ≤ 0.5. Weaim at learning a real function, as an example we focus on

f(u) = 5 cos(πu) exp(−1

2|u|),

We select the input ui, i ∈ {1, · · · ,m} uniformly at random between −5 and 5 with m = 200;we add a random noise to the output, y(u) = f(u) + w with w taken uniformly at randombetween −1 and 1, hence the standard deviation for y(u) is 1/

√(3) ' 0.57. We consider an

implicit model of order n = 75. We learn the parameters of the model by doing only twosuccessive block updates: first, we update (A,B) using stochastic projected gradient descent,the gradient being obtained with the implicit chain rule described in Appendix G. The RMSEacross iterations for this block-update is shown in Figure 10. After this first update weachieve a RMSE of 1.77. Second, we update (C,D) using linear regression. After this update

26 EL GHAOUI ET AL.

Figure 9. Implicit prediction y(u) comparison with f(u)

we achieve a RMSE of 0.56. For comparison purposes, we also train a neural network with 3hidden layers of width n/3 = 25 using ADAM, mini-batches, and a tuned learning rate. Werun Adam until convergence. We get a RMSE = 0.65, which is slightly above that of ourimplicit model. This first simple experiment shows the ability to fit nonlinear functions withimplicit models as illustrated in Figure 9.

Figure 10. RMSE across projected gradient iterations for the (A,B) block update

7.2. Comparison with neural networks. In that section we compare the performance ofimplicit models with that of neural networks. Experiments on both synthetic datasets and realdatasets are conducted. The experiment results show that implicit model has the potential ofmatching or exceeding the performance of neural networks in various settings. To simplify theexperiments, we do not apply any specific network structure or regularization during training.When it comes to training neural networks we will always use ADAM with a grid-searchtuned step-size. Similarly, we will always use stochastic projected gradient descent as detailedin Appendix G. The choice of the number of hidden features n for implicit models is always


aligned with the neural network architecture for fair comparison as explained in subsection 3.2.

7.2.1. Synthetic datasets. We first consider two synthetic classification datasets: onegenerated from a neural network and an other from an implicit model. For each dataset,we then aim at fitting both an implicit model and a neural network. Each data point inthe datasets contains an input u ∈ R5 (p = 5) and output y ∈ R2 (q = 2). The dataset isgenerated as follows.

We chose the input datapoints ui ∈ R5 i.i.d. by sampling each entry independentlyuniformly between −1 and 1. The output data yi ∈ R2 is the one-hot representation of theargmax of the output from the generating model, whose parameters are chosen uniformly atrandom between −1 and 1.

• For a neural network generating model, we us a fully connected 3-layer (5-5-5-2) feed-forward neural network with ReLU activation.• For an implicit model, we consider an implicit model with n = 10 to match the number

of hidden nodes in the neural network model. To ensure well-posedness, after samplingmatrix the A, we proceed with a projection on the unit-sized infinity norm ball. (SeeAppendix H).

For each dataset, we consider 20 training and test datapoints. The size of datasets arekept small to maintain the over-parameterized settings where the number of parameters islarger than the number of data points. Deep learning falls under the such regime [9]. Werun 5 separate experiments for both the neural network and implicit model generated datasetwith the training model having the same hyperparameters and initialization as the generatingmodels.

0 20 40 60 80 100epochs

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy

implicit model test accuracyneural network test accuracy

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

loss

implicit model test lossneural network test loss

Figure 11. Performance comparison on a syn-thetic dataset generated from a neural network. Av-erage best accuracy, implicit: 0.85, neural network:0.76. The curves are generated from 5 the differentruns with the lines marked as mean and region markedas the standard deviation

0 20 40 60 80 100epochs

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy


0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

loss


Figure 12. Performance comparison on a syn-thetic dataset generated from an implicit model. Av-erage best accuracy, implicit: 0.85, neural networks:0.74. The curves are generated from 5 different runswith the lines marked as mean and region marked asthe standard deviation over the runs.

We find that the implicit model outperforms neural networks in both synthetic exper-

28 EL GHAOUI ET AL.

iments. This may be explained by the increased modeling capacity of the implicit model,given similar parameter size, with respect to its neural network counterpart as mentionedin subsection 1.1.

7.2.2. MNIST dataset. In the digit classification dataset MNIST, the input data pointsare 28 × 28 pixels images of hand written digits, the output is the corresponding digit label.For training purposes, each image is reshaped into 784 dimensional vectors and normalizedbefore training. There are m = 5× 104 training data points and 104 testing data points.

The architecture we use for the neural network as a reference is a three-layer feedforwardneural network (784-60-40-10) with ReLU activation. For the implicit model, we set n = 100.We choose a batchsize of 100 for both algorithms. The accuracy with respect to iterations isshown in Figure 13. We observe that the accuracy of the implicit model matches that of itsneural network counterpart.

7.2.3. German Traffic Sign Recognition Benchmark. In the German Traffic Sign Recog-nition Benchmark (GTSRB) [49], the input data points are 32× 32 images with rgb channelsof traffic signs consisting of 43 classes. Each image is turned into gray-scale before being re-shaped into a 1024 dimensional vector and re-scaled to be between 0 and 1. There are 34,799training data points and 12,630 testing data points.

For reference, we use a (1024-300-100-43) feedforward neural network with ReLU activa-tions. We choose n = 400 for the implicit model. We choose a batchsize of 100. The accuracywith respect to iterations is shown in Figure 14.

0 2 4 6 8 10epochs

0.93

0.94

0.95

0.96

0.97

0.98

accu

racy


0.10

0.15

0.20

0.25

0.30

loss


Figure 13. Performance comparison on MNIST.Average best accuracy, implicit: 0.976, neural net-works: 0.972. The curves are generated from 5 dif-ferent runs with the lines marked as mean and regionmarked as the standard deviation over the runs.

0 2 4 6 8 10epochs

0.5

0.6

0.7

0.8

0.9

accu

racy


0.4

0.8

1.2

1.6

2.0

loss


Figure 14. Performance comparison on GTSRB.Average best accuracy, implicit: 0.874, neural net-works: 0.859. The curves are generated from 5 dif-ferent runs with the lines marked as mean and regionmarked as the standard deviation over the runs.

Similar to what we observed with synthetic datasets, the implicit model is capable ofmatching and even outperform classical neural networks.

7.3. Adversarial attack.


7.3.1. Attack via the sensitivity matrix. Our analysis above highlights the use of sensi-tivity matrix as a measure for robustness. In this section, we show how sensitivity matrix canbe used to generate effective attacks on two public datasets, MNIST and CIFAR-10. We com-pare our method against commonly used gradient-based attacks [22, 41]. In this experiment,we consider two models: 1) feed-forward network trained on the MNIST dataset (98% cleanaccuracy) and 2) ResNet-20 [27] trained on the CIFAR-10 dataset (92% clean accuracy).

Figure 15. Left: sensitivity values of a feed-forward network for the class “digit 0” in MNIST. Right:sensitivity values of a ResNet-20 model for the class “airplane” in CIFAR-10. Brighter colors correspond tohigher sensitivity when perturbed.

Figure 15 shows the sensitivity values for a particular class on MNIST and CIFAR-10dataset. The sensitivity values are obtain from the implicit representation of the feed-forwardnetwork and ResNet-20. The 784 and 3072 input dimensions are arranged to correspond tothe 28× 28 (grey scale) and 32× 32 (color) image pixel alignment. Brighter colors correspondto features with a higher impact on the output when perturbed. As a result, the sensitivitymatrix can be used to generate adversarial attacks.

We compare our method with commonly used gradient-based attacks. Precisely, for agiven function F (prediction rule) learned by a deep neural network, a benign sample u ∈ Rpand the target y associated with u, we compute the gradient of the function F with respectto the given sample u, ∇uF (u, y). We then take the absolute value of the gradient as anindication of which input features an adversary should perturb, similar to the saliency maptechnique [47, 41]. The absolute value of the gradient can be seen as a “local” version of thesensitivity matrix; however, unlike the gradient, the sensitivity matrix does not depend onthe input data, making it a more general measurement of robustness for any given model.

Table 1 presents the experimental results of an adversarial attack using the sensitivitymatrix and the absolute value of gradient on MNIST and CIFAR-10. For sensitivity matrixattack, we start from perturbing the input features that have the highest values according tothe sensitivity matrix. For gradient-based attack, we do the same according to the absolutevalue of the gradient. We perturb the input features into small random values. Our exper-iments show that the sensitivity matrix attack is as effective as the gradient-based attack,while being very simple to implement.

Interestingly, our attack does not rely on any input samples that the gradient-based attackneeds. An adversary with the model parameters could easily craft adversarial samples using

30 EL GHAOUI ET AL.

the sensitivity matrix. In the absence of access to the model parameters, an adversary can relyon the principle of tranferrability [35] and train a surrogate model to obtain the sensitivitymatrix. Figure 16 displays adversarial images generated using the sensitivity matrix. Aninteresting case is to use the sensitivity matrix to generate a sparse attack as seen in Figure 16.

Table 1Experimental results of attack success rate against percentage of perturbed inputs on MNIST and CIFAR-10

(10 000 samples from test set).

Sensitivity matrix attack Gradient-based attack% of perturbed inputs MNIST CIFAR-10 MNIST CIFAR-10

0.1% 1.01% 3.04% 2.42% 1.75%1% 13.41% 10.16% 26.92% 6.66%10% 70.67% 36.21% 74.90% 33.18%20% 89.82% 57.01% 87.10% 52.57%30% 90.22% 67.45% 89.82% 66.59%

Figure 16. Top: adversarial samples from MNIST. On the left are dense attacks with small perturbationsand on the right are sparse attacks with random perturbations (perturbed pixels are marked as red). Bottom:example sparse attack on CIFAR-10. The left ones are cleaned images, the middle ones are perturbed images,and the right ones mark the perturbed pixels in red for higher visibility.

7.3.2. Attack with LP relaxation for CONE maps. Although one can use sensitivitymatrix to generate an effective adversarial example, one may wish to perform a more sophis-ticated attack by exploiting the weakness of an individual data point. This can be done byconsidering the LP relaxation (Theorem 4.5), which has the advantage of generating a specificadversarial example for a given input data. The experiment in this section is again on MNISTand CIFAR-10 images. Here, the problem outlined in (4.12) is solved by LP relaxation, withthe function fi(ξ) = (ξ − x0i )2. The optimization problem then finds a perturbed image thatleads to the largest discrepancy between the perturbed state x and the nominal state x0. Fig-ure 17 shows five example images. The clean images are shown in the top row. The perturbed


images generated by the LP relaxation, as shown in the bottom row, appear quite similarto the naked eye; however, the model fails to predict these images correctly. The examplesshown here are all correctly predicted by the model when no perturbation is added.

Our framework also allows for sparse adversarial attack by adding a cardinality constraint.Figure 18 shows three examples of perturbed images under non-sparse and sparse attack.Images on the left are the results of non-sparse attack and those on the right are the resultsof sparse attack. The model fails to predict the label correctly under both conditions. Theseresults illustrate how the implicit prediction rules can be used to generate powerful adversarialattacks. It is also useful for adversarial training as a large amount of adversarial examplescan be generated using the technique and be added back to the training data.

Figure 17. Example attack on CIFAR dataset. Top: clean data. Bottom: perturbed data.

Figure 18. Example attack on MNIST dataset. Left: non-sparse attack. Right: sparse attack.

8. Prior Work.

8.1. In implicit deep learning. Recent works have considered versions of implicit models,and demonstrated their potential in deep learning. In particular, recent ground-breaking workby Kolter and collaborators [8, 29] demonstrated success of an entirely implicit framework,which they call Deep Equilibrium Models, for the task of sequence modeling. Paper [14] usesimplicit methods to solve and construct a general class of models known as neural ordinarydifferential equations, while [17] uses implicit models to construct a differentiable physicsengine that enables gradient-based learning and high sample efficiency. Furthermore, manypapers explore the concept of integrating implicit models with modern deep learning methodsin a variety of ways. For example, [52] show promise in integrating logical structures into deeplearning by incorporating a semidefinite programming (SDP) layer into a network in order to

32 EL GHAOUI ET AL.

solve a (relaxed) MAXSAT problem; see also [53]. In [1] the authors propose to include a modelpredictive control as a differentiable policy class for deep reinforcement learning, which canbe seen as a kind of implicit architecture. In [2] the authors introduced implicit layers wherethe activation is the solution of some quadratic programming problem; in [18], the authorsincorporate stochastic optimization formulation for end-to-end learning task, in which themodel is trained by differentiating the solution of a stochastic programming problem.

8.2. In lifted models. In implicit learning, there is usually no way to express the statevariable in closed-form, which makes the task of computing gradients with respect to modelparameters challenging. Thus, a natural idea in implicit learning is to keep the state vectoras a variable in the training problem, resulting in a higher-dimensional (or, “lifted”) expres-sion of the training problem. The idea of lifting the dimension of the training problem in(non-implicit) deep learning by introducing “state” variables has been studied in a variety ofworks; a non-extensive list includes [50], [5], [24], [56], [57], [12] and [34]. Lifted models aretrained using block coordinate descent methods, Alternating Direction Method of Multipliers(ADMM) or iterative, non-gradient based methods. In this work, we introduce a novel aspectof lifted models, namely the possibility of defining a prediction rule implicitly.

8.3. In robustness analysis. The issue of robustness in deep learning is generating quitea bit of attention, due to the fact that many deep learning models suffer from the lack ofrobustness. Prior relevant work have demonstrated that deep learning models are vulnerableto adversarial attacks [22, 30, 41, 31]. The vulnerability issue of deep learning models havemotivated the study of corresponding defense strategies [37, 42, 43, 23, 46, 54, 16]. However,many of the defense strategies are later shown to be ineffective [6, 11], suggesting the needsfor the theoretical understanding of robustness evaluations for deep learning model. In thiswork, we formalize the robustness analysis of deep learning via the lens of the implicit model.A large number of deep learning architectures can be modeled using implicit prediction rules,making our robustness evaluation a versatile analysis tool.

In [43], the authors arrive at a similar formulation (which we discuss below) but do notexplore the quality of the attacks that can be extracted by solving (4.19).

The SDP-based formulation given in subsection 4.6 parallels the formulation in [43] wherethe authors represent interval sets (l ≤ x ≤ u) as the quadratic constraint x◦x ≤ (l+u)◦x−l◦uand work on the rank-relaxed version of their problem to arrive at a SDP. Upon furtherinspection, our formulation is similar to, and therefore extends, the formulation proposed in[43] for feed-forward neural networks. Indeed, for a neural network, A is block strictly uppertriangular and (I − |A|)−1 has a closed form expression; in the case of three layers we have:

A =

0 W2 00 0 W1

0 0 0

, B =

00W0

,leading to

(I − |A|)−1|B| =

I |W2| |W2||W1|0 I |W1|0 0 I

00|W0|

=

|W2||W1||W0||W1||W0||W0|

.


The difference between our proposed SDP and the SDP in [43] is how the input-uncertaintyconstraints are dealt with at every layer. In [43], the authors strengthen their formulationby reducing the size of the uncertainty set at each layer. That is, they propagate the inputuncertainty and bound the size of the uncertainty at each layer of the neural network (seeSections 5.1 and 5.2 of [43]). As written, (4.19) does not include these constraints and hencecan be tightened by adding extra box-constraints as is done in the aforementioned reference.This allows to extend, to implicit models, the results that were derived for traditional feed-forward networks.

8.4. In sparsity, compression and deep feature selection. Sparsity and compression,which are well understood in classical settings, have found their place in deep learning andare an active branch of research. Early work in pruning dates back to as early as the 90s[33, 26] and has since gained interest. In [48], the authors showed that by randomly droppingunits (i.e. increasing the sparsity level of the network or compressing the network) reducesoverfitting and improved the generalization performance of networks. Recently, more sophis-ticated ways of pruning networks have been proposed, in an effort to reduce the overall size ofthe model, while retaining or accepting a modest decrease in accuracy: a non-extensive list ofworks include [58, 40, 25, 45, 3, 32, 13, 36, 20]. Sparsity and model compression is very closelytied with other techniques to reduce model size. This includes, but is not limited to, low-rankmatrix factorization, group sparsity regularization, quantization techniques and low precisionnetworks. While this sub-branch of research is far from being complete, sparsity and com-pression offer promise in the way of low memory footprints and generalization performancefor deep networks.

9. Concluding Remarks. We conclude with a few perspectives for future research onimplicit models.

Figure 19. Some cousins of implicit models: LTI systems (bottom left) and uncertain systems (bottom right).

9.1. Cousins of implicit models. Implicit models rely on a representation of the predictionrule where the linear operations are clearly separated from the (parameter-free) nonlinear ones,leading to the block-diagram in Figure 1.

This idea is strongly reminiscent of earlier representations arising in systems and controltheory. Perhaps the most famous example lies with linear time-invariant (LTI) systems, where

34 EL GHAOUI ET AL.

the idea is essentially equivalent to the state-space representation of transfer functions [4].In that context, the block-diagram representation involves an integrator (in continuous ordiscrete time) in place of the static nonlinear map φ. The idea is also connected to the linear-fractional representation (or, transformation, LFT) arising in uncertain systems [19], wherenow the linear operations actually represent a dynamic system (that is, the matrix M is anLTI system itself), and the map φ is replaced by an uncertain matrix of parameters or LTIsystem.

These connections raise the prospect of a unified theory tackling deep networks inside theloop of a dynamical system, where such networks can be composed (via feedback connections)with LTI or more generally uncertain systems. With the machinery of Linear Matrix In-equalities or the more general Integral Quadratic Constraints framework [38], one can developrigorous analyses performed on systems with deep networks in the loop, so that bounds on,say, stability margins can be computed.

9.2. Minimal representations. The prediction rule u→ y(u) provided by a given implicitmodel can be in general obtained by many alternative models—scaling the model matricesvia the diagonal scaling given in (2.6) illustrates this non-unicity. An interesting theoreticalstudy would focus on obtaining representations that are minimal from the point of viewof size, precisely the order n, which would extend the well-established theory of minimalrepresentations of LTI systems.

Acknowledgments. We would like to acknowledge useful comments and suggestions byMert Pilanci and Zico Kolter.

REFERENCES

[1] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, Differentiable mpc for end-to-endplanning and control, in Advances in Neural Information Processing Systems, 2018, pp. 8289–8300.

[2] B. Amos and J. Z. Kolter, Optnet: Differentiable optimization as a layer in neural networks, inProceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017,pp. 136–145.

[3] S. Anwar, K. Hwang, and W. Sung, Structured pruning of deep convolutional neural networks, ACMJournal on Emerging Technologies in Computing Systems (JETC), 13 (2017), pp. 1–18.

[4] J. D. Aplevich, The essentials of linear state-space systems, Wiley New York, 2000.[5] A. Askari, G. Negiar, R. Sambharya, and L. E. Ghaoui, Lifted neural networks, arXiv preprint

arXiv:1805.01532, (2018).[6] A. Athalye, N. Carlini, and D. A. Wagner, Obfuscated gradients give a false sense of security:

Circumventing defenses to adversarial examples, in Proceedings of the 35th International Conferenceon Machine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, vol. 80of Proceedings of Machine Learning Research, PMLR, 2018, pp. 274–283.

[7] D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align andtranslate, arXiv preprint arXiv:1409.0473, (2014).

[8] S. Bai, J. Z. Kolter, and V. Koltun, Deep equilibrium models. Preprint submitted, 2019.[9] M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the clas-

sical bias–variance trade-off, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15849–15854.

[10] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge university press, 2004.[11] N. Carlini and D. A. Wagner, Adversarial examples are not easily detected: Bypassing ten detec-

tion methods, in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security,


AISec@CCS 2017, Dallas, TX, USA, November 3, 2017, ACM, 2017, pp. 3–14.[12] M. Carreira-Perpinan and W. Wang, Distributed optimization of deeply nested systems, in Proceed-

ings of the Seventeenth International Conference on Artificial Intelligence and Statistics, S. Kaski andJ. Corander, eds., vol. 33 of Proceedings of Machine Learning Research, Reykjavik, Iceland, 22–25Apr 2014, PMLR, pp. 10–19, http://proceedings.mlr.press/v33/carreira-perpinan14.html.

[13] S. Changpinyo, M. Sandler, and A. Zhmoginov, The power of sparsity in convolutional neuralnetworks, arXiv preprint arXiv:1702.06257, (2017).

[14] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, Neural ordinary differential equa-tions, in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle,K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 6571–6583,http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf.

[15] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, andY. Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine trans-lation, arXiv preprint arXiv:1406.1078, (2014).

[16] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, Certified adversarial robustness via randomized smooth-ing, 2019, https://arxiv.org/abs/arXiv:1902.02918.

[17] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter, End-to-end differentiable physics for learning and control, in Advances in Neural Information Pro-cessing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, andR. Garnett, eds., Curran Associates, Inc., 2018, pp. 7178–7189, http://papers.nips.cc/paper/7948-end-to-end-differentiable-physics-for-learning-and-control.pdf.

[18] P. Donti, B. Amos, and J. Z. Kolter, Task-based end-to-end model learning in stochastic optimization,in Advances in Neural Information Processing Systems, 2017, pp. 5484–5494.

[19] G. E. Dullerud and F. Paganini, A course in robust control theory: a convex approach, vol. 36,Springer Science & Business Media, 2013.

[20] U. Evci, F. Pedregosa, A. Gomez, and E. Elsen, The difficulty of training sparse neural networks,arXiv preprint arXiv:1906.10732, (2019).

[21] B. Gao and L. Pavel, On the properties of the softmax function with application in game theory andreinforcement learning, arXiv preprint arXiv:1704.00805, (2017).

[22] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May7-9, 2015, Conference Track Proceedings, 2015, http://arxiv.org/abs/1412.6572.

[23] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. A.Mann, and P. Kohli, On the effectiveness of interval bound propagation for training verifiably robustmodels, CoRR, abs/1810.12715 (2018), http://arxiv.org/abs/1810.12715, https://arxiv.org/abs/1810.12715.

[24] F. Gu, A. Askari, and L. El Ghaoui, Fenchel lifted networks: A Lagrange relaxation of neural networktraining, arXiv preprint arXiv:1811.08039, (2018).

[25] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning,trained quantization and huffman coding, arXiv preprint arXiv:1510.00149, (2015).

[26] B. Hassibi, D. G. Stork, and G. J. Wolff, Optimal brain surgeon and general network pruning, inIEEE international conference on neural networks, IEEE, 1993, pp. 293–299.

[27] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, 2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 770–778.

[28] K. Kawakami, Supervised sequence labelling with recurrent neural networks, Ph. D. thesis, (2008).[29] J. Kolter, Personal communication with A. Askari. Aug. 2019.[30] A. Kurakin, I. J. Goodfellow, and S. Bengio, Adversarial examples in the physical world, in

5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, OpenReview.net, 2017, https://openreview.net/forum?id=HJGU3Rodl.

[31] A. Kurakin, I. J. Goodfellow, and S. Bengio, Adversarial machine learning at scale, 2017, https://arxiv.org/abs/1611.01236.

[32] V. Lebedev and V. Lempitsky, Fast convnets using group-wise brain damage, in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2554–2564.

http://proceedings.mlr.press/v33/carreira-perpinan14.html

http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf

https://arxiv.org/abs/arXiv:1902.02918

http://papers.nips.cc/paper/7948-end-to-end-differentiable-physics-for-learning-and-control.pdf

http://papers.nips.cc/paper/7948-end-to-end-differentiable-physics-for-learning-and-control.pdf

http://arxiv.org/abs/1412.6572


https://arxiv.org/abs/1810.12715


https://openreview.net/forum?id=HJGU3Rodl

https://openreview.net/forum?id=HJGU3Rodl



36 EL GHAOUI ET AL.

[33] Y. LeCun, J. S. Denker, and S. A. Solla, Optimal brain damage, in Advances in neural informationprocessing systems, 1990, pp. 598–605.

[34] J. Li, C. Fang, and Z. Lin, Lifted proximal operator machines, in Proceedings of the AAAI Conferenceon Artificial Intelligence, vol. 33, 2019, pp. 4181–4188.

[35] Y. Liu, X. Chen, C. Liu, and D. Song, Delving into transferable adversarial examples and black-boxattacks, in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017, https://openreview.net/forum?id=Sys6GJqxl.

[36] C. Louizos, M. Welling, and D. P. Kingma, Learning sparse neural networks through l 0 regulariza-tion, arXiv preprint arXiv:1712.01312, (2017).

[37] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning modelsresistant to adversarial attacks, in International Conference on Learning Representations, 2018, https://openreview.net/forum?id=rJzIBfZAb.

[38] A. Megretski and A. Rantzer, System analysis via integral quadratic constraints, IEEE Transactionson Automatic Control, 42 (1997), pp. 819–830.

[39] C. D. Meyer, Matrix analysis and applied linear algebra, vol. 71, Siam, 2000.[40] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, Exploring sparsity in recurrent neural networks,

arXiv preprint arXiv:1704.05119, (2017).[41] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, The limitations

of deep learning in adversarial settings, in IEEE European Symposium on Security and Privacy,EuroS&P 2016, Saarbrucken, Germany, March 21-24, 2016, IEEE, 2016, pp. 372–387, https://doi.org/10.1109/EuroSP.2016.36, https://doi.org/10.1109/EuroSP.2016.36.

[42] N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami, Distillation as a defense to adversarialperturbations against deep neural networks, in IEEE Symposium on Security and Privacy, SP 2016,San Jose, CA, USA, May 22-26, 2016, IEEE Computer Society, 2016, pp. 582–597, https://doi.org/10.1109/SP.2016.41, https://doi.org/10.1109/SP.2016.41.

[43] A. Raghunathan, J. Steinhardt, and P. S. Liang, Semidefinite relaxations for certifying robustnessto adversarial examples, in Advances in Neural Information Processing Systems, 2018, pp. 10877–10887.

[44] S. Sastry, Nonlinear systems: analysis, stability, and control, vol. 10, Springer Science & Business Media,2013.

[45] A. See, M.-T. Luong, and C. D. Manning, Compression of neural machine translation models viapruning, arXiv preprint arXiv:1606.09274, (2016).

[46] U. Shaham, Y. Yamada, and S. Negahban, Understanding adversarial training: Increasing local stabil-ity of neural nets through robust optimization, (2015), https://doi.org/10.1016/j.neucom.2018.04.027,https://arxiv.org/abs/arXiv:1511.05432.

[47] K. Simonyan, A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: Visualising imageclassification models and saliency maps, in 2nd International Conference on Learning Representations,ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, Y. Bengio andY. LeCun, eds., 2014, http://arxiv.org/abs/1312.6034.

[48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: asimple way to prevent neural networks from overfitting, The journal of machine learning research, 15(2014), pp. 1929–1958.

[49] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, The german traffic sign recognition benchmark:a multi-class classification competition, in The 2011 international joint conference on neural networks,IEEE, 2011, pp. 1453–1460.

[50] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, Training neural networkswithout gradients: A scalable ADMM approach, in International conference on machine learning, 2016,pp. 2722–2731.

[51] B. Travacca, S. Moura, and L. El Ghaoui, Implicit optimization: Models and methods [to be pub-lished, cdc 2020].

[52] P.-W. Wang, P. Donti, B. Wilder, and Z. Kolter, SATNet: Bridging deep learning and logicalreasoning using a differentiable satisfiability solver, in Proceedings of the 36th International Con-ference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of

https://openreview.net/forum?id=Sys6GJqxl

https://openreview.net/forum?id=Sys6GJqxl

https://openreview.net/forum?id=rJzIBfZAb

https://openreview.net/forum?id=rJzIBfZAb

https://doi.org/10.1109/EuroSP.2016.36



https://doi.org/10.1109/SP.2016.41

https://doi.org/10.1109/SP.2016.41

https://doi.org/10.1109/SP.2016.41

https://doi.org/10.1016/j.neucom.2018.04.027




Machine Learning Research, Long Beach, California, USA, 09–15 Jun 2019, PMLR, pp. 6545–6554,http://proceedings.mlr.press/v97/wang19e.html.

[53] P.-W. Wang, P. L. Donti, B. Wilder, and Z. Kolter, Satnet: Bridging deep learning and logicalreasoning using a differentiable satisfiability solver, arXiv preprint arXiv:1905.12149, (2019).

[54] E. Wong and J. Z. Kolter, Provable defenses against adversarial examples via the convex outer ad-versarial polytope, 2017, https://arxiv.org/abs/arXiv:1711.00851.

[55] Y. Yu, X. Si, C. Hu, and J. Zhang, A review of recurrent neural networks: Lstm cells and networkarchitectures, Neural computation, 31 (2019), pp. 1235–1270.

[56] J. Zeng, T. T.-K. Lau, S. Lin, and Y. Yao, Global convergence of block coordinate descent in deeplearning, arXiv preprint arXiv:1803.00225, (2018).

[57] Z. Zhang and M. Brand, Convergent block coordinate descent for training Tikhonov regularized deepneural networks, in Proceedings of the 31st International Conference on Neural Information ProcessingSystems, NIPS’17, USA, 2017, Curran Associates Inc., pp. 1719–1728, http://dl.acm.org/citation.cfm?id=3294771.3294935.

[58] M. Zhu and S. Gupta, To prune, or not to prune: exploring the efficacy of pruning for model compres-sion, arXiv preprint arXiv:1710.01878, (2017).

http://proceedings.mlr.press/v97/wang19e.html


http://dl.acm.org/citation.cfm?id=3294771.3294935

http://dl.acm.org/citation.cfm?id=3294771.3294935

38 EL GHAOUI ET AL.

Appendix A. Proof of Theorem 2.2. Let b ∈ Rn. We first prove the existence of asolution ξ ∈ Rn to the equation ξ = φ(Aξ + b). Consider the Picard iteration (2.4). We havefor every t ≥ 1:

|x(t+ 1)− x(t)| = |φ(Ax(t) + b)− φ(Ax(t− 1) + b)| ≤ |A||x(t)− x(t− 1)|,

which implies that for every t, τ ≥ 0:

|x(t+ τ)− x(t)| ≤t+τ∑k=t

|A|k|x(1)− x(0)| ≤ |A|tτ∑k=0

|A|k|x(1)− x(0)| ≤ |A|tw,

where

w :=

+∞∑k=0

|A|k|x(1)− x(0)| = (I − |A|)−1|x(1)− x(0)|.

Here we exploited the fact that, due to λpf(|A|) < 1, I − |A| is invertible, and the seriesabove converges. Since limt→0 |A|t = 0, we obtain that x(t) is a Cauchy sequence, hence ithas a limit point, x∞. By continuity of φ we further obtain that x∞ = φ(Ax∞ + b), whichestablishes the existence of a solution.

To prove unicity, consider x1, x2 ∈ Rn+ two solutions to the equation. Using the hypothesesin the theorem, we have, for any k ≥ 1:

|x1 − x2| ≤ |A||x1 − x2| ≤ |A|k|x1 − x2|.

The fact that |A|k → 0 as k → +∞ then establishes unicity.

Appendix B. Proof of Theorem 2.7. We follow the steps taken in the proof of The-orem 2.2. Let b ∈ Rn. Our first step is to establish that for the Picard iteration (2.4), wehave

η(x(t+ 1)− x(t)) ≤Mη(x(t+ 1)− x(t))

for an appropriate non-negative matrix M ≥ 0. Here, η is a vector of norms, as definedin (1.2). For every l ∈ [L], t ≥ 1:

[η(x(t+ 1)− x(t))]l = ‖[φ(A(x(t) + b)− φ(Ax(t− 1) + b)]l‖pl= ‖φl((A(x(t) + b)l)− φl((Ax(t− 1) + b)l)‖pl≤ γl‖[A(x(t)− x(t− 1))]l‖pl= γl‖

∑h∈[L]

Alh(x(t)− x(t− 1))h‖pl

≤ γl∑h∈[L]

‖A‖pl→ph‖xh(t+ 1)− xh(t)‖ph = γl[Mη(x(t+ 1)− x(t))]l,

which establishes the desired bound, with M := diag(γ)N(A), where γ is the vector ofLipschitz constants, and N(·) is the L× L matrix of induced norms (2.5).


Assume now that λpf(M) < 1, as posited in the Theorem. Then I −M is invertible. Weproceed to prove existence by showing that the sequence of Picard iterates is Cauchy: forevery t, τ ≥ 0,

η(x(t+ τ)− x(t)) ≤t+τ∑k=t

Mkη(x(1)− x(0)) ≤M tτ∑k=0

Mkη(x(1)− x(0)) ≤M tw,

where

w :=

+∞∑k=0

Mkη(x(1)− x(0)) = (I −M)−1η(x(1)− x(0)).

The above proves the existence.To prove unicity, consider x1, x2 ∈ Rn+ two solutions to the equation. Using the hypotheses

in the theorem, we have, for any k ≥ 1:

η(x1 − x2) ≤Mη(x1 − x2) ≤Mkη(x1 − x2).

The fact that Mk → 0 as k → +∞ then establishes unicity.

Appendix C. Proof of Theorem 2.8. Express the equation x = φ(Ax+ b) as

x1 = φ(A11x1 +A12x2 + b1), x2 = φ(A22x2 + b2),

where b = (b1, b2), x = (x1, x2), with bi ∈ Rni , xi ∈ Rni , i = 1, 2. Here, since φ actscomponentwise, we use the same notation φ in the two equations.

Now assume that A11 and A22 are well-posed with respect to φ. Since A22 is well-posed forφ, the second equation has a unique solution x∗2; plugging x2 = x∗2 into the second equation,and using the well-posedness of A11, we see that the first equation has a unique solution inx1, hence A is well-posed.

To prove the converse direction, assume that A is well-posed. The second equation abovemust have a unique solution x∗2, irrespective to the choice of b2, hence A22 must be well-posed.To prove that A11 must be well-posed too, set b2 = 0, b1 arbitrary, leading to the system

x1 = φ(A11x1 +A12x2 + b1), x2 = φ(A22x2).

Since A22 is well-posed for φ, there is a unique solution x∗2 to the second equation; the firstequation then reads x1 = φ(A11x1 + b1 + A12x

∗2). It must have a unique solution for any b1,

hence A11 is well-posed.

Appendix D. Proofs of Theorem 4.1, Theorem 4.2 and Theorem 4.4. First let usprove Theorem 4.1. Due to the non-expansivity of the CONE map φ:

|x− x0| ≤ |A||x− x0|+ b,(D.1)

where b := |B|σu ≥ 0. Since λpf(|A|) < 1, the matrix I − |A| is invertible and its inverse iscomponentwise non-negative; for any n-vector b ≥ 0, we have

σ := (I − |A|)−1b =∑k≥0|A|kb ≥ 0.

40 EL GHAOUI ET AL.

Further, any n-vector σ such that σ ≤ |A|σ + b, we have, for every integer N :

σ ≤ |A|Nσ +N−1∑k=0

|A|kb ≤ |A|Nσ + σ.

Since λpf(|A|) < 1, we have |A|N → 0 as N → +∞, which yields σ ≤ σ. Applying this toσ = |x− x0| yields the desired result.

Now turn to the proof of Theorem 4.2. For every l ∈ [L]:

[η(x− x0)]l ≤ ‖[φ(A(x− x0) +B(u− u0)b)]l‖pl≤ γl‖

∑h∈[L]

Alh(x− x0)h‖pl + γl‖∑i∈[p]

Bli(u− u0)i‖pl

≤ γl[N(A)η(x− x0))]l + γl∑i∈[p]

‖Bli‖pl |u− u0|i

≤ γl[N(A)η(x− x0))]l + γl[N(B)|u− u0|]l,

which establishes the desired bound.The result of Theorem 4.4 is obtained similarly. For given i ∈ [q], we have

|y(u)− y(u0)|i ≤ |∑l∈[L]

Cil(x− x0)l|+ (|D||u− u0|)i

≤∑l∈[L]

‖Cil‖p∗l η(x− x0)l + (|D||u− u0|)i.

Appendix E. Proof of Theorem 4.5. We have

p∗ ≤ p := minλ

maxx, u∈U

∑i∈[n]

fi(xi) + λ>(Ax+Bu− z) : x = z+, |x− x0| ≤ σx

= minλ

maxu∈U

λ>Bu+ maxz : |z+−x0|≤σx

∑i∈[n]

(fi(z+i ) + (A>λ)iz

+i − λizi)

= minλ, µ=A>λ

(maxu∈U

λ>Bu

)+∑i∈[n]

gi(λi, µi),

which establishes the first part of the theorem. If we further assume that the functions gi,i ∈ [n] are closed, strong duality holds, so that

p = minλ, µ

maxx, u∈U

λ>Bu+ x>(A>λ− µ) +∑i∈[n]

gi(λi, µi)

= maxx, u∈CoU

−∑i∈[n]

g∗i (−(Ax+Bu)i, xi)

where g∗i is the conjugate of gi, i ∈ [n].


Appendix F. Proof of Theorem 5.1. We have

|x− x0| ≤ |Ax−Ax0| = |A0(x− x0) + E(x− x0) + Ex0| ≤ |A0 + E||x− x0|+ |E||x0|.

Applying a technique similar to that employed in the proof of Theorem 4.1, we obtain thedesired relative error bounds.

Appendix G. Gradient Equations. In this section, we detail gradient computationswith respect to the model parameter matrix M = (A,B,C,D), in the context of the trainingproblem 6.4. We assume that the map φ is BLIP with parameters nl, pl, γl, l ∈ [L], and thatA satisfies the corresponding PF well-posedness condition with respect to φ; we also assumethat the latter is differentiable.

For simplicity, we first consider mini-batches of size 1 (that is, X = x ∈ Rn and U = u ∈Rp), and we define y = Cx+Du, z = Ax+Bu. We wish to calculate

∇ML =

(∇AL ∇BL∇CL ∇DL

).

The difficult part are the terms ∇AL and ∇BL due to the presence of the equilibrium equalityconstraint. We deal with this via implicit differentiation. In order to establish the existenceof gradients we will need the following lemma.

Lemma G.1. Assume that the map φ is a BLIP map with parameters nl, pl, γl, l ∈ [L],

and that A is well-posed with respect to φ. Define the block-diagonal matrix Φ := ∂φ(z)∂z , then

the equation in the n × n matrix G: G = Φ(AG + I) has a unique solution, which can becomputed as the limit point of the recursion

(G.1) G(t+ 1) = Φ(AG(t) + I), t = 0, 1, 2, . . .

Proof. The result follows from the fact that if φ is BLIP with parameters nl, pl, γl, l ∈ [L],then Φ = diag(Φ1, . . . ,ΦL), where ‖Φl‖nl

< 1, l ∈ [L] and Theorem 2.7.

Calculating ∇AL. For a given index pair (j, k), we have

∂L∂Ajk

=∂L∂z· ∂Ax+Bu

∂Ajk=∂L∂z·∂∑

hAlhxh∂Ajk

=∂L∂zejxk =

[∇zL x>

]jk.

We calculate ∇zL via implicit differentiation:

∇zL =

(∂L∂x· ∂x∂z

)>,∂L∂x

=∂L∂y· ∂Cx+Du

∂x,

∂x

∂z=∂φ(z)

∂z+∂φ(Ax+Bu)

∂x· ∂x∂z

= (I − ΦA)−1Φ,

where Φ := ∂φ(z)∂z is a block diagonal matrix. Thanks to Lemma G.1, the matrix G :=

(I−ΦA)−1Φ exists and can be obtained via fixed-point iterations G.1. Note that the gradientof the loss function ∇yL can be easily computed, and we have

(G.2) ∇zL =(C(I − ΦA)−1Φ

)>∇yL = (CG)>∇yL.

42 EL GHAOUI ET AL.

Calculating the other gradients. The other gradients are easily computed, as seen next,where (j, k) denotes a generic index pair. From the above it follows that

∂L∂Bjk


∂Bjk=[∇zL u>

]jk

=⇒ ∇BL = ∇zL u>,

where ∇zL is given in G.2. Likewise,

∂L∂Cjk


∂Cjk=[∇yL x>

]jk

=⇒ ∇CL = ∇yL x>,

and

∂L∂Djk


∂Djk=[∇yL u>

]jk

=⇒ ∇DL = ∇yL u>.

Finally, the gradient with respect to input u is obtained as

∂L∂uj


∂uj+∂L∂y· ∂Cx+Du

∂uj=∂L∂z·Bej +

∂L∂y·Dej

=[B>∇zL+D>∇yL

]j

=⇒ ∇uL = B>∇zL+D>∇yL,

where ∇zL is given in G.2.To summarize, we have obtained the following closed form evaluation of gradients for

(A,B,C,D):

∇ML =

(∇AL ∇BL∇CL ∇DL

)=

(∇zL∇yL

)(xu

)>,

where ∇yL can be easily computed and ∇zL is given by the following equation:

∇zL = (CG)>∇yL, G = (I − ΦA)−1Φ = Φ(AG+ I).

Let v := ∇zL, we have

v =(C(I − ΦA)−1Φ

)>∇yL,which is equivalent to

(G.3) v = Φ(A>v + C>∇yL

).

Again (G.3) can be computed via the recursion (6.5). In practice we use (G.3) to compute∇zLwithout explicitly forming G via (G.1) where each iteration would involve a matrix-matrixproduct instead of a matrix-vector product in (G.3).


Mini-batch calculations. We may efficiently compute the gradient corresponding to a wholemini-batch of size greater than 1. In the case of componentwise maps φ, this is based onexpressing the equation (G.3) as

(G.4) V = Ψ ◦(A>V + C>L

),

where columns of L contains gradients of the loss with respect to y, and Ψ is a matrix thatcontains the derivatives of the activation, one column corresponding to the (diagonal) elementsof Φ, where Φ is defined in the previous section.

Appendix H. Projection on l∞ Matrix Norm Ball. We address problem (6.6), which wewrite as

p∗ := minA

1

2‖A−A0‖2F :

∑j∈[n]

|Aij | ≤ κ, i ∈ [n].

where A0 ∈ Rn×n is given. The problem is decomposable across the rows of the matricesinvolved, leading to n sub-problems of the form

mina

1

2‖a− a0i ‖22 : ‖a‖1 ≤ κ,

which a0i ∈ Rn the i-th row of A0.The problem cannot be solved in closed form, but a bisection method can be applied to

the dual:p∗ = max

λ≥0−κλ+

∑i∈[n]

si(λ),

where, for λ ≥ 0 given:

si(λ) := minξ

1

2(ξ − a0i )2 + λ|ξ|, i ∈ [n].

A subgradient of the objective is

gi(λ) := −κ+∑i∈[n]

max(|a0i | − λ, 0), i ∈ [n].

Observe that p∗ ≥ 0, hence at optimum:

0 ≤ λ ≤ 1

κ

∑i∈[n]

s(λ, a0i ) ≤ λmax :=1

2κ‖a0i ‖22.

The bisection can be initialized with the interval λ ∈ [0, λmax].Returning to the original problem (6.6), we see that all the iterations can be expressed

in a “vectorized” form, where updates for the different rows of A are done in parallel. Thedual variables corresponding to each row are collected in a vector λ ∈ Rn. We initialize thebisection with a vector interval [λl, λu], with λl = 0, λui = 1

2‖a0i ‖22/κ, i ∈ [n]. We update the

current vector interval as follows:

44 EL GHAOUI ET AL.

1. Set λ = (λl + λu)/2.2. Form a vector g(λ) containing the sub-gradients corresponding to each row, evaluated

at λi, i ∈ [n]:g(λ) = −κ1 + (|A0| − λ1T )T+1.

3. For every i ∈ [n], reset λui = λi if gi(λ) > 0, λli = λi if gi(λ) ≤ 0.

implicit deep learning - arxiv · network in order to solve a (relaxed) maxsat problem. in [16],...

Documents