algorithm unrolling: interpretable, efﬁcient deep …1 algorithm unrolling: interpretable,...

1

Algorithm Unrolling: Interpretable, Efficient DeepLearning for Signal and Image Processing

Vishal Monga, Senior Member, IEEE, Yuelong Li, Member, IEEE, and Yonina C. Eldar, Fellow, IEEE

Abstract—Deep neural networks provide unprecedented per-formance gains in many real world problems in signal andimage processing. Despite these gains, future development andpractical deployment of deep networks is hindered by their black-box nature, i.e., lack of interpretability, and by the need forvery large training sets. An emerging technique called algorithmunrolling or unfolding offers promise in eliminating these issuesby providing a concrete and systematic connection betweeniterative algorithms that are used widely in signal processing anddeep neural networks. Unrolling methods were first proposed todevelop fast neural network approximations for sparse coding.More recently, this direction has attracted enormous attentionand is rapidly growing both in theoretic investigations andpractical applications. The growing popularity of unrolled deepnetworks is due in part to their potential in developing efficient,high-performance and yet interpretable network architecturesfrom reasonable size training sets. In this article, we reviewalgorithm unrolling for signal and image processing. We exten-sively cover popular techniques for algorithm unrolling in variousdomains of signal and image processing including imaging,vision and recognition, and speech processing. By reviewingprevious works, we reveal the connections between iterativealgorithms and neural networks and present recent theoreticalresults. Finally, we provide a discussion on current limitations ofunrolling and suggest possible future research directions.

Index Terms—Deep learning, neural networks, algorithm un-rolling, unfolding, image reconstruction, interpretable networks.

I. INTRODUCTION

The past decade has witnessed a deep learning revolution.Availability of large-scale training datasets often facilitatedby internet content, accessibility to powerful computationalresources thanks to breakthroughs in microelectronics, andadvances in neural network research such as developmentof effective network architectures and efficient training algo-rithms have resulted in unprecedented successes of deep learn-ing in innumerable applications of computer vision, patternrecognition and speech processing. For instance, deep learninghas provided significant accuracy gains in image recognition,one of the core tasks in computer vision. Groundbreak-ing performance improvements have been demonstrated viaAlexNet [1], and lower classification errors than human-levelperformance [2] was reported on the ImageNet dataset [3].

In the realm of signal processing, learning based approachesprovide an interesting algorithmic alternative to traditional

V. Monga is with Department of Electrical Engineering, The Penn-sylvania State University, University Park, PA, 16802 USA, Email:[email protected]

Y. Li is with Sony US Research Center, San Jose, CA, 95112 USA,Email: [email protected]

Y. C. Eldar is with Weizmann Institute of Science, Rehovot, Israel.Email: [email protected].

model based analytic methods. In contrast to conventionaliterative approaches where the models and priors are typicallydesigned by analyzing the physical processes and handcrafting,deep learning approaches attempt to automatically discovermodel information and incorporate them by optimizing net-work parameters that are learned from real world training sam-ples. Modern neural networks typically adopt a hierarchicalarchitecture composed of many layers and comprise a largenumber of parameters (can be millions), and are thus capableof learning complicated mappings which are difficult to designexplicitly. When training data is sufficient, this adaptivityenables deep networks to often overcome model inadequacies,especially when the underlying physical scenario is hard tocharacterize precisely.

Another advantage of deep networks is that during infer-ence, processing through the network layers can be executedvery fast. Many modern computational platforms are highlyoptimized towards special operations such as convolutions,so that inference via deep networks is usually quite com-putationally efficient. In addition, the number of layers in adeep network is typically much smaller than the number ofiterations required in an iterative algorithm. Therefore, deeplearning methods have emerged to offer desirable computa-tional benefits over state-of-the art in many areas of signalprocessing, imaging and vision. Their popularity has reachednew heights with widespread availability of the supportingsoftware infrastructure required for their implementation.

That said, a vast majority of deep learning techniques arepurely data-driven and the underlying structures are difficultto interpret. Previous works largely apply general networkarchitectures (some of them will be covered in II-A) towardsdifferent problems, and learn certain underlying mappingssuch as classification and regression functions completelythrough end-to-end training. It is therefore hard to discoverwhat is learned inside the networks by examining the networkparameters, which are usually of high dimensionality, and whatare the roles of individual parameters. Interpretability is ofcourse, an important concern both in theory and practice. Itis usually the key to conceptual understanding and advancingthe frontiers of knowledge. Moreover, in areas such as medicalapplications and autonomous driving, it is crucial to identifythe limitations and potential failure cases of designed systems,where interpretability plays a fundamental role. Thus, lack ofinterpretability can be a serious limitation of conventional deeplearning methods in contrast with model-based techniques withiterative algorithmic solutions which are used widely in signalprocessing.

An issue that frequently arises together with interpretability

arX

iv:1

912.

1055

7v1

[ee

ss.I

V]

22

Dec

201

9

2

is generalizability. It is well known that the practical success ofdeep learning is sometimes overly dependent on the quantityand quality of training data available. In scenarios whereabundant high-quality training samples are unavailable such asmedical imaging [4], [5] and 3D reconstruction [6], the perfor-mance of deep networks may degrade significantly, and some-times may even underperform traditional approaches. Thisphenomenon, formally called overfitting in machine learningterminology, is largely due to the employment of genericneural networks that are associated with a huge number ofparameters. Without exploiting domain knowledge explicitlybeyond time and/or space invariance, such networks are highlyunder regularized and may overfit severely even under heavydata augmentations.

To some extent, this problem has recently been addressedvia the design of domain enriched or prior information guideddeep networks [7], [8], [9]. In such cases, the network archi-tecture is designed to contain special layers that are domainspecific, such as the transform layer in [8]. In other cases, priorstructure of the expected output is exploited [8], [9] to developtraining robustness. An excellent tutorial article covering theseissues is [10]. Despite these achievements, transferring domainknowledge to network parameters can be a nontrivial taskand effective priors may be hard to design. More importantly,the underlying network architectures in these approacheslargely remain consistent with conventional neural networks.Therefore, the pursuit of interpretable, generalizable and highperformance deep architectures for signal processing problemsremains a highly important open challenge.

In the seminal work of Gregor and LeCun [11], a promisingtechnique called algorithm unrolling (or unfolding) was devel-oped that has helped connect iterative algorithms such as thosefor sparse coding to neural network architectures. Followingthis, the past few years have seen a surge of efforts that unrolliterative algorithms for many significant problems in signaland image processing: examples include (but are not limitedto) compressive sensing [12], deconvolution [13] and varia-tional techniques for image processing [14]. Figure 1 providesa high-level illustration of this framework. Specifically, eachiteration of the algorithm step is represented as one layer of thenetwork. Concatenating these layers forms a deep neural net-work. Passing through the network is equivalent to executingthe iterative algorithm a finite number of times. In addition,the algorithm parameters (such as the model parameters andregularization coefficients) transfer to the network parameters.The network may be trained using back-propagation resultingin model parameters that are learned from real world train-ing sets. In this way, the trained network can be naturallyinterpreted as a parameter optimized algorithm, effectivelyovercoming the lack of interpretability in most conventionalneural networks.

Traditional iterative algorithms generally entail significantlyfewer number of parameters compared with popular neuralnetworks. Therefore, the unrolled networks are highly param-eter efficient and require less training data. In addition, theunrolled networks naturally inherit prior structures and domainknowledge rather than learn them from intensive training data.Consequently, the unrolled networks tend to generalize better

than generic networks, and can be computationally faster aslong as each algorithmic iteration (or the corresponding layer)is not overly expensive.

In this article, we review the foundations of algorithmunrolling. Our goal is to provide readers with guidance on howto utilize unrolling to build efficient and interpretable neuralnetworks in solving signal and image processing problems.After providing a tutorial on how to unroll iterative algorithmsinto deep networks, we extensively review selected applica-tions of algorithm unrolling in a wide variety of signal andimage processing domains. We also review general theoreticalstudies that shed light on convergence properties of thesenetworks, although further analysis is an important problemfor future research. In addition, we clarify the connectionsbetween general classes of traditional iterative algorithms anddeep neural networks established through algorithm unrolling.We contrast algorithm unrolling with alternative approachesand discuss their strengths and limitations. Finally, we discusscurrent limitations and open challenges, and suggest futureresearch directions.

II. GENERATING INTERPRETABLE NETWORKS THROUGHALGORITHM UNROLLING

We begin by describing algorithm unrolling. To motivatethe unrolling approach, we commence with a brief review onconventional neural network architectures in Section II-A. Wenext discuss the first unrolling technique for sparse coding inSection II-B. We elaborate on general forms of unrolling inSection II-C.

A. Conventional Neural Networks

In early neural network research the Multi-Layer Percep-trons (MLP) was a popular choice. This architecture canbe motivated either biologically by mimicking the humanrecognition system, or algorithmically by generalizing theperceptron algorithm to multiple layers. A diagram illustrationof MLP is provided in Fig. 2a. The network is constructedthrough recursive linear and nonlinear operations, which arecalled layers. The units those operations act upon are calledneurons, which is an analogy of the neurons in the humannerve systems. The first and last layer are called input layerand output layer, respectively.

A salient property of this network is that each neuron isfully connected to every neuron in the previous layer, exceptfor the input layer. The layers are thus commonly calledFully-Connected layers. Analytically, in the l-th layer therelationship between the neurons xlj and xl+1

i is expressedas

xl+1i = σ

∑

j

Wl+1ij xlj + bl+1

j

(1)

where Wl+1 and bl+1 are the weights and biases, respectively,and σ is a nonlinear activation function. We omit drawingactivation functions and biases in Fig. 2a for brevity. Popularchoices of activation functions include the logistic function and

3

Input h1(·; θ1

)h2(·; θ2

)· · · OutputUnrollingh(·; θ)

End-to-end Training

Interpretable Layers

Output

Input

Fig. 1. A high-level overview of algorithm unrolling: given an iterative algorithm (left), a corresponding deep network (right) can be generated by cascadingits iterations h. The iteration step h (left) is executed a number of times, resulting in the network layers h1, h2, . . . (right). Each iteration h depends onalgorithm parameters θ, which are transferred into network parameters θ1, θ2, . . . . Instead of determining these parameters through cross-validation or analyticalderivations, we learn θ1, θ2, . . . from training datasets through end-to-end training. In this way, the resulting network could achieve better performance thanthe original iterative algorithm. In addition, the network layers naturally inherit interpretability from the iteration procedure. The learnable parameters arecolored in blue.

the hyperbolic tangent function. In recent years, they have beensuperseded by Rectified Linear Units (ReLU) [15] defined by

ReLU(x) = max{x, 0}.

The W’s and b’s are generally trainable parameters that arelearned from datasets through training, during which back-propagation [16] is often employed for gradient computation.

Nowadays, MLPs are rarely seen in practical imagingand vision applications. The fully-connected nature of MLPscontributes to a rapid increase in their parameters, makingthem difficult to train. To address this limitation, LeCun [17]proposed to restrict neuron connections to local neighbors onlyand let weights be shared across different spatial locations. Thelinear operations then become convolutions (or correlations ina strict sense) and thus the networks employing such localizingstructures are generally called Convolutional Neural Networks(CNN). A visual illustration of a CNN can be seen in Fig. 2b.With significantly reduced parameter dimensionality, trainingdeeper networks becomes feasible. While CNNs were firstapplied to digit recognition, their translation invariance isa desirable property in many computer vision tasks. CNNsthus have become an extremely popular and indispensablearchitecture in imaging and vision, and outperform traditionalapproaches by a large margin in many domains. Today theycontinue to exhibit the best performance in many applications.

In applications such as speech recognition and video pro-cessing, where data comes sequentially, Recurrent NeuralNetworks (RNN) [18] are a popular choice. RNNs explicitlymodel the data dependence in different time steps in thesequence and scale well to sequences with varying lengths.A visual depiction of RNNs is provided in Fig. 2c. Given

the previous hidden state sl−1 and input variable xl, the nexthidden state sl is computed as

sl = σ1(Wsl−1 + Uxl + b1

),

while the output variable ol is generated by

ol = σ2(Vsl + b2

).

Here U,V,W,b1,b2 are trainable network parameters andσ1, σ2 are activation functions. We again omit the activationfunctions and biases in Fig. 2c. In contrast to MLPs andCNNs where the layer operations are applied recursively in ahierarchical representation fashion, RNNs apply the recursiveoperations as the time step evolves. A distinctive property ofRNNs is that the parameters U,V,W are shared across allthe time steps, rather than varying from layer to layer. TrainingRNNs can thus be difficult as the gradients of the parametersmay either explode or vanish.

B. Unrolling Sparse Coding Algorithms into Deep Networks

The earliest work in algorithm unrolling dates back to Gre-gor et al.’s paper on improving the computational efficiency ofsparse coding algorithms through end-to-end training [11]. Inparticular, they discussed how to improve the efficiency of theIterative Shrinkage and Thresholding Algorithm (ISTA), oneof the most popular approaches in sparse coding. The crux ofthis work is summarized in Fig. 3 and detailed in the box onLearned ISTA. Each iteration of ISTA comprises one linearoperation followed by a non-linear soft-thresholding opera-tion, which mimics the ReLU activation function. A diagramrepresentation of one iteration step reveals its resemblance toa single network layer. Thus, one can form a deep networkby mapping each iteration to a network layer and stacking

4

Input Layer Hidden Layers Output layer

x01

x02

x03

xl1

xl2

xl3

xl+11

xl+12

xl+13

xL1

xL2

xL3

· · ·

W11,1

W11,2

W13,3

Wl+11,1

Wl+13,3

(a)

Input Layer Hidden Layers Output layer

x01

x02

x03

x04

xl1

xl2

xl3

xl4

xl+11

xl+12

xl+13

xl+14

xL1

xL2

xL3

xL4

· · ·

W11,1

W11,2

W13,3

Wl+11,1

Wl+13,3

(b)

HiddenStates

Inputs

Outputs

s0 s1 sL×W

o0

×V

x0

×U

o1

×V

x1

×U

oL

×V

xL

×U

. . .

x01

x02

x03

s01

s02

s03

U1,1

U1,2

(c)

Fig. 2. Conventional neural network architectures that are popular insignal/image processing and computer vision applications: (a) Multi-LayerPerceptron (MLP) where all the neurons are fully connected; (b) ConvolutionalNeural Network (CNN) where the neurons are sparsely connected andthe weight matrices Wl, l = 1, 2, . . . , L effectively become convolutionoperators; (c) Recurrent Neural Network (RNN) where the inputs xl, l =1, 2, . . . , L are fed in sequentially and the parameters U,V,W are sharedacross different time steps.

the layers together which is equivalent to executing an ISTAiteration multiple times. Because the same parameters areshared across different layers, the resulting network resemblesRNN in terms of architecture.

After unrolling ISTA into a network, the network is trainedusing training samples through back-propagation. The learnednetwork is dubbed Learned ISTA (LISTA). It turns out thatsignificant computational benefits can be obtained by learningfrom real data. For instance, Gregor et al. [11] experimen-tally verified that the learned network reaches a specificperformance level around 20 times faster than acceleratedISTA. Consequently, the sparse coding problem can be solvedefficiently by passing through a compact LISTA network.From a theoretical perspective, recent studies [19], [20] have

characterized the linear convergence rate of LISTA and furtherverified its computational advantages in a rigorous and quan-titative manner. A more detailed exposition and discussion onrelated theoretical studies will be provided in Section IV-D.

In addition to ISTA, Gregor et al. discussed unrollingand optimizing another sparse coding method, the CoordinateDescent (CoD) algorithm [21]. The technique and implicationsof unrolled CoD are largely similar to LISTA.

C. Algorithm Unrolling in General

Although Gregor et al.’s work [11] focused on improvingcomputational efficiency of sparse coding, the same techniquescan be applied to general iterative algorithms. An illustration isgiven in Fig. 4. In general, the algorithm repetitively performscertain analytic operations, which we represent abstractly asthe h function. Similar to LISTA, we unroll the algorithminto a deep network by mapping each iteration into a singlenetwork layer and stacking a finite number of layers together.Each iteration step of the algorithm contains parameters, suchas the model parameters or the regularization coefficients,which we denote by vectors θl, l = 0, . . . , L − 1. Throughunrolling, the parameters θl correspond to those of the deepnetwork, and can be optimized towards real-world scenariosby training the network end-to-end using real datasets.

While the motivation of LISTA was computational savings,proper use of algorithm unrolling can also lead to dramaticallyimproved performance in practical applications. For instance,we can employ back-propagation to obtain coefficients offilters [13] and dictionaries [22] that are hard to design eitheranalytically or even by handcrafting. In addition, custommodifications may be employed in the unrolled network [12].As a particular example, in LISTA (see the box Learned ISTA),the matrices Wt, We may be learned in each iteration, sothat they are no longer held fixed throughout the network. Byallowing a slight departure from the original iterative algo-rithms [11], [23] and extending the representation capacity,the performance of the unrolled networks may be boostedsignificantly.

Compared with conventional generic neural networks, un-rolled networks generally contain significantly fewer parame-ters, as they encode domain knowledge through unrolling. Inaddition, their structures are more specifically tailored towardstarget applications. These benefits not only ensure higherefficiency, but also provide better generalizability especiallyunder limited training schemes [12]. More concrete exampleswill be presented and discussed in Section III.

III. UNROLLING IN SIGNAL AND IMAGE PROCESSINGPROBLEMS

Algorithm unrolling has been applied to diverse applicationareas in the past few years. Table I summarizes representa-tive methods and their topics of focus in different domains.Evidently, research in algorithm unrolling is growing andinfluencing a variety of high impact real-world problems andresearch areas. As discussed in Section II, an essential elementof each unrolling approach is the underlying iterative algorithmit starts from, which we also specify in Table I.

5

Learned ISTA

The pursuit of parsimonious representation of signals hasbeen a problem of enduring interest in signal processing.One of the most common quantitative manifestations ofthis is the well-known sparse coding problem [24]. Givenan input vector y ∈ Rn and an over-complete dictionaryW ∈ Rn×m with m > n, sparse coding refers to thepursuit of a sparse representation of y using W. In otherwords, we seek a sparse code x ∈ Rm such that y ≈Wxwhile encouraging as many coefficients in x to be zero (orsmall in magnitude) as possible. A well-known approach todetermine x is to solve the following convex optimizationproblem:

minx∈Rm

1

2‖y −Wx‖22 + λ‖x‖1, (2)

where λ > 0 is a regularization parameter that controlsthe sparseness of the solution.A popular method for solving (2) is the family of IterativeShrinkage and Thresholding Algorithms (ISTA) [25]. Inits simplest form, ISTA performs the following iterations:

xl+1 = Sλ{(

I− 1

µWTW

)xl +

1

µWTy

}, l = 0, 1, . . .

(3)

where I ∈ Rm×m is the identity matrix, Sλ(·) is the soft-thresholding operator defined elementwise as

Sλ(x) = sign(x) ·max {|x| − λ, 0} , (4)

and µ > 0 is a parameter that controls the step size. Basi-cally, ISTA is equivalent to a gradient step of ‖y−Wx‖22,followed by a projection onto the `1 ball.As depicted in Fig. 3, the iteration (3) can be recast intoa single network layer. This layer comprises a series ofanalytic operations (matrix-vector multiplication, summa-tion, soft-thresholding), which is of the same nature as a

neural network. Executing ISTA L times can be interpretedas cascading L such layers together, which essentiallyforms an L-layer deep network. In the unrolled networkan implicit substitution of parameters has been made:Wt = I − 1

µWTW and We = 1µWT . While these

substitutions generalize the original parametrization andexpand the representation power of the unrolled network,recent theoretical studies [20] suggest that they may beinconsequential in an asymptotic sense as the optimalnetwork parameters admit a weight coupling schemeasymptotically.The unrolled network is trained using real datasets tooptimize the parameters Wt, We and λ. The learnedversion, LISTA, may achieve higher efficiency comparedto ISTA. It is also useful when W is not known ex-actly. Training is performed using a sequence of vectorsy1,y2, . . . ,yN ∈ Rn and their corresponding groundtruthsparse codes x∗1,x∗2, . . . ,x∗N .By feeding each yn, n = 1, . . . , N into the network, weretrieve its output xn (yn;Wt,We, λ) as the predictedsparse code for yn. Comparing it with the ground truthsparse code x∗n, the network training loss function isformed as:

`(Wt,We, λ) =1

N

N∑

n=1

‖xn (yn;Wt,We, λ)− x∗n‖22 ,

(5)

and the network is trained through loss minimization,using popular gradient-based learning techniques such asstochastic gradient descent [16], to learn Wt, Wt, λ. Ithas been shown empirically that, the number of layers L in(trained) LISTA can be an order of magnitude smaller thanthe number of iterations required for ISTA [11] to achieveconvergence corresponding to a new observed input.

A. Applications in Computational Imaging

Computational imaging is a broad area covering a widerange of interesting topics, such as computational photography,hyperspectral imaging, and compressive imaging, to namea few. The key to success in many computational imagingareas frequently hinges on solving an inverse problem. Model-based inversion has long been a popular approach. Examplesof model-based methods include parsimonious representationssuch as sparse coding and low-rank matrix pursuit, variationalmethods and conditional random fields. The employmentof model-based techniques gives rise to many iterative ap-proaches, as closed-form solutions are rarely available. Thefertile ground of these iterative algorithms in turn provides asolid foundation and offers many opportunities for algorithmunrolling.

Single image super-resolution is an important topic incomputational imaging that focuses on improving the spatialresolution of a single degraded image. In addition to offering

images of improved visual quality, super-resolution also aidsdiagnosis in medical applications and promises to improvethe performance of recognition systems. Compared with naivebicubic interpolation, there exists large room for performanceimprovement by exploiting natural image structures, such aslearning dictionaries to encode local image structures intosparse codes [36]. Significant research effort has been devotedtowards structure-aware approaches. Wang et al. [22] appliedLISTA (which we discussed in Section II-B) to patches ex-tracted from the input image, and recombine the predictedhigh-resolution patches to form a predicted high-resolutionimage. A diagram depiction of the entire network architecture,dubbed Sparsity Coding based Network (SCN), is providedin Fig. 5. Trainable parameters of SCN include those ofLISTA (Wt, We, λ) and the dictionary D. By integrating thepatch extraction and recombination layers into the network,the whole network resembles a CNN as it also performspatch-by-patch processing. The network is trained by pairing

6

Algorithm: Input x0, Output xL

for l = 0, 1, . . . , L− 1 do

xl+1 = Sλ((

I− 1µW

TW)xl + 1

µWTy)

end for

xl + Sλ xl+1Wt = I− 1

µWTW

We = 1µW

T

Iterative Shrinkage and Thresholding Algorithm A Single Network Layer

y

Stacking

Wt

We

x0 + Sλ x1 + Sλ x2 · · · xL

y · · ·

Unrolled Deep Network

Wt Wt

We We

Fig. 3. Illustration of LISTA: one iteration of ISTA executes a linear and non-linear operation and thus can be recast into a network layer; by stacking thelayers together a deep network is formed. The network is subsequently trained using paired inputs and outputs by back-propagation to optimize the parametersWe,Wt and λ. The trained network, dubbed LISTA, is computationally more efficient compared with the original ISTA. The trainable parameters in thenetwork are colored in blue. For details, see the box on LISTA. In practice, We,Wt, λ may vary in each layer.


for l = 0, 1, . . . , L− 1 do

xl+1 ← h(xl; θl),

end for

x0 · · · xl h(·; θl) xl+1 · · · xLUnrolling

Iterative Algorithm Unrolled Deep Network

Fig. 4. Illustration of the general idea of algorithm unrolling: starting with an abstract iterative algorithm, we map one iteration (described as the functionh parametrized by θl, l = 0, . . . , L− 1) into a single network layer, and stack a finite number of layers together to form a deep network. Feeding the dataforward through an L-layer network is equivalent to executing the iteration L times (finite truncation). The parameters θl, l = 0, 1, . . . , L − 1 are learnedfrom real datasets by training the network end-to-end to optimize the performance. They can either be shared across different layers or varying from layer tolayer. The trainable parameters are colored in blue.

Inputimage y

Patchextraction

LISTASub-network

Sparsecode α

Recoveredpatch

z = Dα

Patch re-combination

Recoveredimage x

D×

Fig. 5. Illustration of the SCN [22] architecture: the patches extracted from input low-resolution image y are fed into a LISTA sub-network to estimate theassociated sparse codes α, and then high-resolution patches are reconstructed through a linear layer. The predicted high-resolution image x is formed byputting these patches into their corresponding spatial locations. The whole network is trained by forming low-resolution and high-resolution image pairs, usingstandard stochastic gradient descent algorithm. The high resolution dictionary D (colored in blue) and the LISTA parameters are trainable from real datasets.

7

Fig. 6. Sample experimental results from [22] for visual comparison on single image super-resolution. Groundtruth images are in the top row. The second tothe bottom rows include results from [34], [35] and [22], respectively. These include a state-of-the art iterative algorithm as well as a deep learning technique.Note that the magnified portions show that SCN better recovers sharp edges and spatial details.

Blurredimage y

Featureextraction

KernelEstimation

ImageEstimation

Stage 2 Stage 3 Recoveredimage x

Stage 1

Fig. 7. Illustration of the network architecture in [28]: the network is formed by concatenating multiple stages of essential blind image deblurring modules.Stage 2 and 3 repeats the same operations as stage 1, with different trainable parameters. From a conceptual standpoint, each stage imitates one iteration ofa typical blind image deblurring algorithm. The training data can be formed by synthetically blurring sharp images to obtain their blurred versions.

8

TABLE ISUMMARY OF RECENT METHODS EMPLOYING ALGORITHM UNROLLING IN PRACTICAL SIGNAL PROCESSING AND IMAGING APPLICATIONS.

Reference Year Application domain Topics Underlying Iterative Algorithms

Hershey et al. [26] 2014 Speech Processing Signal channel source separation Non-negative matrix factorization

Wang et al. [22] 2015 Computational imaging Image super-resolution Coupled sparse coding with iterative shrink-age and thresholding

Zheng et al. [27] 2015 Vision and Recognition Semantic image segmentation Conditional random field with mean-field it-eration

Schuler et al. [28] 2016 Computational imaging Blind image deblurring Alternating minimization

Chen et al. [14] 2017 Computational imaging Image denoising, JPEG deblocking Nonlinear diffusion

Jin et al. [23] 2017 Medical Imaging Sparse-view X-ray computed tomography Iterative shrinkage and thresholding

Liu et al. [29] 2018 Vision and Recognition Semantic image segmentation Conditional random field with mean-field it-eration

Solomon et al. [30] 2018 Medical imaging Clutter suppression Generalized ISTA for robust principal compo-nent analysis

Ding et al. [31] 2018 Computational imaging Rain removal Alternating direction method of multipliers

Wang et al. [32] 2018 Speech processing Source separation Multiple input spectrogram inversion

Yang et al. [12] 2019 Medical imaging Medical resonance imaging, compressiveimaging

Alternating direction method of multipliers

Li et al. [33] 2019 Computational imaging Blind image deblurring Half quadratic splitting

low-resolution and high-resolution images and minimizing theMean-Square-Error (MSE) loss 1. In addition to higher visualquality and PSNR gain of 0.3dB to 1.6dB over state-of-theart, the network is faster to train and has reduced number ofparameters. Fig. 6 provides sample visual results from SCNand several state-of-the art techniques.

Another important application focusing on improving thequality of degraded images is blind image deblurring. Given asharp image blurred by an unknown function, which is usuallycalled the blur kernel or Point Spread Function, the goal is tojointly estimate both the blur kernel and the underlying sharpimage. There is a wide range of approaches in the literatureto blind image deblurring: the blur kernel can be of differentforms, such as Gaussian, defocusing and motion. In addition,the blur kernel can be either spatially uniform or non-uniform.

Blind image deblurring is a challenging topic because theblur kernel is generally of low-pass nature, rendering theproblem highly ill-posed. Most existing approaches rely onextracting stable features (such as salient edges in naturalimages) to reliably estimate the blur kernels. The sharp imagecan be retrieved subsequently based on the estimated kernels.Schuler et al. [28] reviews many existing algorithms and notethat they essentially iterate over three modules: 1) featureextraction, 2) kernel estimation and 3) image recovery.

Therefore, a network can be built by unrolling and concate-nating several layers of these modules, as depicted in Fig. 7.More specifically, the feature extraction module is modeled asa few layers of convolutions and rectifier operations, whichbasically mimics a small CNN, while both the kernel andimage estimation modules are modeled as least-square opera-tions. To train the network, the blur kernels are simulated by

1The terminology “MSE loss” refers to the empirical squared loss whichis an estimate of the statistical MSE. We stick to this term for ease ofexposition and consistency with the literature.

sampling from Gaussian processes, and the blurred images aresynthetically created by blurring each sharp image through a2D discrete convolution.

Recently, Li et al. [13], [33] develop an unrolling approachfor blind image deblurring by enforcing sparsity constraintsover filtered domains and then unrolling the half-quadraticsplitting algorithm for solving the resulting optimization prob-lem. The network is called Deep Unrolling for Blind Deblur-ring (DUBLID) and detailed in the DUBLID box. As theDUBLID box reveals, custom modifications are made in theunrolled network to integrate domain knowledge that enchanesthe deblurring pursuit. The authors also derive custom back-propagation rules analytically to facilitate network training.Experimentally, the network offers significant performancegains and requires much fewer parameters and less inferencetime compared with both traditional iterative algorithms andmodern neural network approaches. An example of experi-mental comparisons is provided in Fig. 9.

B. Applications in Medical Imaging

Medical imaging is a broad area that generally focuses onapplying image processing and pattern recognition techniquesto aid clinical analysis and disease diagnosis. Interestingtopics in medical imaging include Medical Resonance Imag-ing (MRI), Computed Tomography (CT) imaging, Ultrasound(US) imaging, to name a few. Just like computational imaging,medical imaging is an area enriched with many interestinginverse problems, and model-based approaches such as sparsecoding play a critical role in solving these problems. Inpractice, data collection can be quite expensive and painstakingfor the patients, and therefore it is difficult to collect abundantsamples to train conventional deep networks. Interpretabilityis also an important concern. Therefore, algorithm unrollinghas great potential in this context.

9

Deep Unrolling for Blind Deblurring (DUBLID)

The spatially invariant blurring process can be representedas a discrete convolution:

y = k ∗ x + n, (6)

where y is the blurred image, x is the latent sharp image,k is the unknown blur kernel, and n is Gaussian randomnoise. A popular class of image deblurring algorithmsperform Total-Variation (TV) minimization, which solvesthe following optimization problem:

mink,g1,g2

1

2

(‖Dxy − k ∗ g1‖22 + ‖Dyy − k ∗ g2‖22

)

+ λ1‖g1‖1 + λ2‖g2‖1 +ε

2‖k‖22,

subject to ‖k‖1 = 1, k ≥ 0, (7)

where Dxy, Dyy are the partial derivatives of y in hor-izontal and vertical directions respectively, and λ1, λ2, εare positive regularization coefficients. Upon convergence,the variables g1 and g2 are estimates of the sharp imagegradients in the x and y directions, respectively.In [13], [33] (7) was generalized by realizing that Dx

and Dy are computed using linear filters which can begeneralized into a set of C filters {fi}Ci=1:

mink,{gi}Ci=1

C∑

i=1

(1

2‖fi ∗ y − k ∗ gi‖22 + λi‖gi‖1

)+ε

2‖k‖22,

subject to ‖k‖1 = 1, k ≥ 0. (8)

An efficient optimization algorithm to solving (8) is theHalf-Quadratic Splitting Algorithm, which alternately min-imizes the surrogate problem

mink,{gi,zi}Ci=1

C∑

i=1

(1

2‖fi ∗ y − k ∗ gi‖22

+ λi‖zi‖1 +1

2ζi‖gi − zi‖22

)+ε

2‖k‖22,

subject to ‖k‖1 = 1, k ≥ 0, (9)

over the variables {gi}Ci=1, {zi}Ci=1 and k sequentially.Here ζi, i = 1, . . . , C are regularization coefficients. Anoteworthy fact is that each individual minimization ad-mits an analytical expression, which facilitates casting (9)

into network layers. Specifically, in the l-th iteration(l ≥ 0) the following updates are performed:

gl+1i = F−1

ζli k

l∗� f li � y + zli

ζli

∣∣∣kl∣∣∣2

+ 1

:=M1{f l ∗ y, zl; ζl

}, ∀i,

zl+1i = Sλliζli

{gl+1i

}

:=M2{gl+1;βl

}, ∀i (10)

kl+1 = N1

F−1

∑Ci=1 zl+1

i

∗� f li � yi

∑Ci=1

∣∣∣zl+1i

∣∣∣2

+ ε

+

:=M3{f l ∗ y, zl+1

},

where [·]+ is the ReLU operator, x denotes the DiscreteFourier Transform (DFT) of x, F−1 indicates the inverseDFT operator, � refers to elementwise multiplication, S isthe soft-thresholding operator defined elementwise in (4),and the operator N1(·) normalizes its operand into unitsum. Here ζl = {ζli}

C

i=1, βl = {λliζli}C

i=1, and gl, f l ∗ y,zl refer to {gli}

C

i=1, {f li ∗ y}Ci=1, {zli}C

i=1 stacked together.Note that layer-specific parameters ζl, βl, f l are used.As with most existing unrolling methods, only L iterationsare performed. The sharp image is retrieved from gL andkL by solving the following linear least-squares problem:

x = argminx

1

2

∥∥∥y − k ∗ x∥∥∥2

2+

C∑

i=1

ηi2

∥∥fLi ∗ x− gLi∥∥22

= F−1

k∗� y +

∑Ci=1 ηif

Li

∗� gLi∣∣∣∣

k

∣∣∣∣2

+∑Ci=1 ηi

∣∣∣fLi∣∣∣2

:=M4{y,gL,kL; η, fL

}, (11)

where fL = {fLi }C

i=1 are the filter coefficients in theL-th layer, and η = {ηi}Ci=1 are positive regularizationcoefficients. By unrolling (10) and (11) into a deep net-work, we get L layers of g, z and k updates, followedby one layer of image retrieval. The filter coefficients f li ’sand regularization parameters {λli, ζli , ηi}’s are learned byback-propagation. Note that fLi ’s are shared in both (10)and (11) and are updated jointly. The final network archi-tecture is depicted in Fig. 8.Similar to [28], the network is trained using synthetic sam-ples, i.e., by convolving the sharp images to obtain blurredversions. The training loss function is the translation-invariant MSE loss to compensate for the possible spatialshifts of the deblurred images and the blur kernel.

10

In MRI, a fundamental challenge is to recover a signalfrom a small number of its measurements, corresponding toreduced scanning time. To this end, Yang et al. [12] unrollthe widely-known Alternating Direction Method of Multipliers(ADMM) algorithm, a popular optimization algorithm forsolving CS and related sparsity-constrained estimation prob-lems, into a deep network called ADMM-CSNet. The sparsity-inducing transformations and regularization coefficients are

learned from real data to advance its limited adaptabilityand enhance the reconstruction performance. Compared withconventional iterative methods, ADMM-CSNet achieves thesame reconstruction accuracy using 10% less sampled dataand speeds up recovery by around 40 times. It exceeds state-of-the art deep networks by around 3dB PSNR under 20%sampling rate. Refer to the box on ADMM-CSNet for furtherdetails.

ADMM-CSNet

Given random linear measurements y ∈ Cm formed byy ≈ Φx where Φ ∈ Cm×n is a measurement matrix withm < n, Compressive Sensing (CS) aims at reconstructingthe original signal x ∈ Rn by exploiting its underlyingsparse structure in a transform domain [24].A generalized CS model can be formulated as the follow-ing optimization problem [12]:

minx

1

2‖Φx− y‖22 +

C∑

i=1

λig(Dix), (12)

where λi’s are positive regularization coefficients, g(·) isa sparsity-inducing function, and {Di}Ci=1 is a sequenceof C Toeplitz operators, which effectively perform linearfiltering operations. Concretely, Di can be taken as awavelet transform and g can be chosen as the `1 norm.However, for better performance the method in [12] learnsboth of them from an unrolled network.An efficient minimization algorithm for solving (12)is the Alternating Direction Method of Multipliers(ADMM) [40]. Problem (12) is first recast into a con-strained minimization through variable splitting:

minx,{z}Ci=1

1

2‖Φx− y‖22 +

C∑

i=1

λig(zi),

subject to zi = Dix, ∀i. (13)

The corresponding augmented Lagrangian is then formedas follow:

Lρ(x, z;αi) =1

2‖Φx− y‖22 +

C∑

i=1

λig(zi)

+ρi2‖Dix− zi +αi‖22, (14)

where {αi}Ci=1 are dual variables and {ρi}Ci=1 are penaltycoefficients. ADMM then alternately minimizes (14) fol-lowed by a dual variable update, leading to the followingiterations:

xl =

(ΦHΦ +

C∑

i=1

ρiDTi Di

)−1[ΦHy

+

C∑

i=1

ρiDiT(zil−1 −αil−1

)]

:= U1{y,αl−1i , zl−1i ; ρi,Di

},

zli = Pg{

Dixl +αl−1i ;

λiρi

}(15)

:= U2{αl−1i ,xl;λi, ρi,Di

},

αli = αl−1i + ηi(Dix

l − zli)

:= U3{αl−1i ,xl, zli; ηi,Di

}, ∀i,

where ηi’s are constant parameters, and Pg{·;λ} is theproximal mapping for g with parameter λ. The unrollednetwork can thus be constructed by concatenating theseoperations and learning the parameters λi, ρi, ηi,Di ineach layer. Fig. 10 depicts the resulting unrolled net-work architecture. In [12] the authors discuss severalimplementation issues, including efficient matrix inversionand the back-propagation rules. The network is trainedby minimizing a normalized version of the Root-Mean-Square-Error.

Another important imaging modality is ultrasound, whichhas the advantage of being a radiation-free approach. Whenused for blood flow depiction, one of the challenges is the factthat the tissue reflections tend to be much stronger than thoseof the blood, leading to strong clutter resulting from the tissue.Thus, an important task is to separate the tissue from the blood.Various filtering methods have been used in this context suchas high pass filtering, and filtering based on the singular valuedecomposition. Solomon et al. [30] suggest using a robustPrincipal Component Analysis (PCA) approach by modeling

the received ultrasound movie as a low-rank and sparse matrixwhere the tissue is low rank and the blood vessels are sparse.They then unroll an ISTA approach to robust PCA into a deepnetwork, called Convolutional rObust pRincipal cOmpoNentAnalysis (CORONA). As the name suggests, they replacematrix multiplications with convolutional layers, effectivelyconverting the network into a CNN-like architecture. Com-pared with state of- the art approaches, CORONA demon-strates vastly improved reconstruction quality and has muchfewer parameters than the well-known ResNet [41]. Refer to

11

Blurred Image y ∗fL ∗fL−1

· · ·

· · ·

∗f1

M1(·, ·; ζ1

)

g0

M3 (·, ·)

k0

g1

M2(·;β1

)· · ·Layer L− 1Layer LM4(·, ·, ·; η, fL

)

Estimated Image x

k1

· · ·

Layer 1

Estimated Kernel k

z1

kL−2

zL−2gL zL−1,kL−1

Fig. 8. Diagram illustration of DUBLID [13]. The analytical operations M1,M2,M3,M4 correspond to casting the analytic expressions in (10) and (11)into the network. Trainable parameters are colored in blue.

(a) Groundtruth (b) Perrone et al. [37] (c) Nah et al. [38] (d) Tao et al. [39] (e) DUBLID

Fig. 9. Sample experimental results from [33] for visual comparison on blind image deblurring. Groundtruth images and kernels are included in (a). (b) Atop-performing iterative algorithm and (c)(d) two state-of-the art deep learning techniques are compared against (e) the unrolling method.

12

Measurementsy

. . .

. . .U1(·, ·, ·; ρ,D) U2(·, ·;λ, ρ,D U3(·, ·, ·; η,D) Stage l + 1 . . . Recovered

image x

Stage l

αl−1

zl−1

xl zl αl

Fig. 10. Diagram representation of ADMM-CSNet [12]: each stage comprises a series of inter-related operations, whose analytic forms are given in (15).Note that the trainable parameters are colored in blue.

Fig. 11. Sample experimental results demonstrating recovery of Ultrasound Contrast Agents (UCAs) from cluttered Maximum Intensity Projection (MIP)images [30]. (a) MIP image of the input movie, composed from 50 frames of simulated UCAs cluttered by tissue. (b) Ground-truth UCA MIP image. (c)Recovered UCA MIP image via CORONA. (d) Ground-truth tissue MIP image. (e) Recovered tissue MIP image via CORONA. Color bar is in dB.

the box on CORONA for details.

C. Applications in Vision and RecognitionComputer vision is a broad and fast growing area that has

achieved tremendous success in many interesting topics inrecent years. A major driving force for its rapid progress isdeep learning. For instance, thanks to the availability of largescale training samples, in image recognition tasks researchershave surpassed human-level performance over the ImageNetdataset by employing deep CNN [2]. Nevertheless, mostexisting approaches to date are highly empirical, and lack ofinterpretability has become an increasingly serious issue. Toovercome this drawback, recently researchers are paying moreattention to algorithm unrolling [27], [29].

One example is in semantic image segmentation, whichassigns class labels to each pixel in an image. Compared withtraditional low-level image segmentation, it provides additionalinformation about object categories, and thus creates seman-tically meaningful segmented objects. By performing pixel-level labeling, semantic segmentation can also be regarded asan extension to image recognition. Applications of semanticsegmentation include autonomous driving, robot vision, andmedical imaging.

Traditionally Conditional Random Field (CRF) was a pop-ular approach. Recently deep networks have become theprimary tool. Early works of deep learning approaches arecapable of recognizing objects at a high level; however, theyare relatively less accurate in delineating the objects thanCRFs.

Zheng et al. [27] unroll the Mean-Field (MF) iterationsof CRF into a RNN, and then concatenate the semanticsegmentation network with this RNN to form a deep network.The concatenated network resembles conventional semanticsegmentation followed by CRF-based post-processing, whileend-to-end training can be performed over the whole network.Liu et al. [29] follow the same direction to construct theirsegmentation network, called Deep Parsing Network. In theirapproach, they adopt a generalized pairwise energy and per-form MF iteration only once for the purposes of efficiency.Refer to the box on Unrolling CRF into RNN for furtherdetails.

D. Applications in Speech Processing

Speech processing is one of the fundamental applications ofdigital signal processing. Topics in speech processing includerecognition, coding, synthesis, and more. Among all problemsof interest, source separation stands out as a challengingyet intriguing one. Applications of source separation includespeech enhancement and recognition.

For single channel speech separation, Non-negative MatrixFactorization (NMF) was a widely-applied technique. Re-cently, Hershey et al. [26] unrolled NMF into a deep network,dubbed deep NMF, as a concrete realization of their abstractunrolling framework. Detailed descriptions are in the box DeepNMF. The deep NMF was evaluated on the task of speechenhancement in reverberated noisy mixtures, using a datasetcollected from the Wall Street Journal. Deep NMF was shown

13

to outperform both a conventional deep neural network [26]and the iterative sparse NMF method [42].

Wang et al. [32] propose an end-to-end training approach forspeech separation, by casting the commonly employed forwardand inverse Short-time Fourier Transform (STFT) operationsinto network layers, and concatenate them with an iterative

phase reconstruction algorithm, dubbed Multiple Input Spec-trogram Inversion [43]. In doing so, the loss function actson reconstructed signals rather than their STFT magnitudes,and phase inconsistency can be reduced through training. Thetrained network exhibits more than 1dB higher Signal-to-Distortion Ratio over state-of-the art techniques on publicdatasets.

Convolutional Robust Principal Component Analysis

In US imaging, a series of pulses are transmitted intothe imaged medium, and their echoes are received ineach transducer element. After beamforming and demod-ulation, a series of movie frames are acquired. Stackingthem together as column vectors leads to a data matrixD ∈ Cm×n, which can be modeled as follows:

D = H1L + H2S + N,

where L comprises the tissue signals, S comprises theechoes returned from the blood signals, H1,H2 are mea-surement matrices, and N is the noise matrix. Due to itshigh spatial-temporal coherence, L is typically a low-rankmatrix, while S is generally a sparse matrix since bloodvessels usually sparsely populate the imaged medium.Based on these observations, the echoes S can be estimatedthrough a transformed low-rank and sparse decompositionby solving the following optimization problem:

minL,S

1

2‖D− (H1L + H2S)‖2F + λ1‖L‖∗ + λ2‖S‖1,2,

(16)

where ‖·‖∗ is the nuclear norm of a matrix which promoteslow-rank solutions, and ‖ · ‖1,2 is the mixed `1,2 normwhich enforces row sparsity. Problem (16) can be solvedusing a generalized version of ISTA in the matrix domain,by utilizing the proximal mapping corresponding to thenuclear norm and mixed `1,2 norm. In the l-th iteration itexecutes the following steps:

Ll+1 = Tλ1µ

{(I− 1

µHT

1 H1

)Ll −HH

1 H2Sl + HH

1 D

},

Sl+1 = S1,2λ2µ

{(I− 1

µHH

2 H2

)Sl −HH

2 H1Ll + HH

2 D

},

where Tλ{X} is the singular value thresholding operatorthat performs soft-thresholding over the singular valuesof X with threshold λ, S1,2λ performs row-wise soft-thresholding with parameter λ, and µ is the step sizeparameter for ISTA. Technically, Tλ and S1,2λ correspondto the proximal mapping for the nuclear norm and mixed`1,2 norm, respectively. Just like the migration from MLPto CNN, the matrix multiplications can be replaced byconvolutions, which gives rise to the following iterationsteps instead:

Ll+1 = Tλl1{Pl

5 ∗ Ll + Pl3 ∗ Sl + Pl

1 ∗D}, (17)

Sl+1 = S1,2λl2

{Pl

6 ∗ Sl + Pl4 ∗ Ll + Pl

2 ∗D}, (18)

where ∗ is the convolution operator. Here Pli, i = 1, . . . , 6

are a series of convolution filters that are learned fromthe data in the l-th layer, and λl1, λ

l2 are thresholding

parameters for the l-th layer. By casting (17) and (18)into network layers, a deep network resembling CNN isformed. The parameters Pl

i, i = 1, 2, . . . , 6 and {λl1, λl2}are learned from training data.To train the network, one can first obtain ground-truthL and S from D by executing ISTA-like algorithms upto convergence. Simulated samples can also be added toaddress lack of training samples. MSE losses are imposedon L and S, respectively.

E. Enhancing Efficiency Through Unrolling

In addition to intepretability and performance improve-ments, unrolling can provide significant advantages for prac-tical deployment, including higher computational efficiencyand lower number of parameters, which in turn lead toreduced memory footprints and storage requirements. Table IIsummarizes selected results from recent unrolling papers to il-lustrate such benefits. For comparison, results for one iterativealgorithm and one deep network are included, both selectedfrom representative top-performing methods. Note further, thatfor any two methods compared in Table II, the run times arereported on consistent implementation platforms. More detailscan be found in the respective papers.

Compared to their iterative counterparts, unrolling oftendramatically boosts the computational speed. For instance, itwas reported in [16] that LISTA may be 20 times faster thanISTA after training; DUBLID [33] can be 1000 times fasterthan TV-based deblurring; ADMM-CSNet [12] can be aboutfour times faster than the BM3D-AMP algorithm [44]; andCORONA [30] is over 50 times faster than the Fast-ISTAalgorithm.

Typically, by embedding domain-specific structures into thenetwork, the unrolled networks need much fewer parametersthan conventional networks that are less specific towardsparticular applications. For instance, the number of parametersfor DUBLID is more than 100 times lower than SRN [39],

14

Inputimage y

FCN

MessagePassing

CompatibilityTransform

UnaryAddition Normalization Stage 2 · · · Predicted

Label l

Stage 1

CRF-RNN

φ

Q1 Q2

Fig. 12. Diagram representation of the CRF-RNN network [27]: a FCN is concatenated with a RNN called CRF-RNN to form a deep network. This RNNessentially performs MF iterations and acts like CRF-based post-processing. The concatenated network can be trained end-to-end to optimize its performance.

while CORONA [30] has an order of magnitude lower numberof parameters than ResNet [41].

Under circumstances where each iteration (layer) can beexecuted highly efficiently, the unrolled networks may even

be more efficient than conventional networks. For instance,ADMM-CSNet [12] has shown to be about twice as fast asReconNet [45], while DUBLID [29] is almost two times fasterthan DeblurGAN [46].

Unrolling CRF into RNN

CRF is a fundamental model for labeling undirectedgraphical models. A special case of CRF, where onlypairwise interactions of graph nodes are considered, isthe Markov random field. Given a graph (V, E) and apredefined collection of labels L, it assigns label lp ∈ L toeach node p by minimizing the following energy function:

E({lp}p∈V) =∑

p∈Vφp(lp) +

∑

(p,q)∈Eψp,q(lp, lq),

where φp(·) and ψp,q(·) are commonly called unaryenergy and pairwise energy, respectively. Typically, φpmodels the preference of assigning p with each label giventhe observed data, while ψp,q models the smoothnessbetween p and q. In semantic segmentation, V comprisesthe image pixels, E is the set of pixel pairs, while Lconsists of object categories, respectively.In [27], the unary energy φp is chosen as the outputof a semantic segmentation network, such as the well-known Fully Convolutional Network (FCN) [47], while thepairwise energy ψp,q(fp, fq) admits the following specialform:

ψ(lp, lq) = µ(p,q)

M∑

m=1

wmGm(fp, fq),

where {Gm}Mm=1 is a collection of Gaussian kernels and{wm}Mm=1 are the corresponding weights. Here fp and fq

are the feature vectors for pixel p and q, respectively, andµ(·, ·) models label compatibility between pixel pairs.An efficient inference algorithm for energy minimizationover fully-connected CRFs is the MF iteration [48], whichexecutes the following steps iteratively:

(Message Passing) : Qmp (l)←∑

j 6=iGm(fi, fj)Qq(l),

(Compatibility Transform) : (19)

Qp(lp)←∑

l∈L

∑

m

µm(lp, l)wmQmp (l), (20)

(Unary Addition) : Qp(lp)← exp{−φp(lp)− Qp(lp)

},

(Normalization) : Qp(lp)←Qp(lp)∑l∈LQp(l)

,

where Qp(lp) can be interpreted as the margin probabilityof assigning p with label lp. A noteworthy fact is thateach update step resembles common neural network layers.For instance, message passing can be implemented byfiltering through Gaussian kernels, which imitates passingthrough a convolutional layer. The compatibility transformcan be implemented through 1 × 1 convolution, whilethe normalization can be considered as the popular soft-max layer. These layers can thus be unrolled to form aRNN, dubbed CRF-RNN. By concatenating FCN with it,a network which can be trained end-to-end is formed. Adiagram illustration is in presented Fig. 12.

15

x01

x02

x03

· · ·

xl1

xl2

xl3

xl+11

xl+12

xl+13

xL1

xL2

xL3

· · ·

Wl+11,1


for l = 0, 1, . . . , L− 1 do

xl+1 ← σ(Wl+1xl + bl+1

),

end for

Activation function

Fig. 13. A MLP can be interpreted as executing an underlying iterative algorithm with finite iterations and layer-specific parameters.

Deep (Unrolled) NMF

Single channel source separation refers to the task ofdecoupling several source signals from their mixture. Sup-pose we collect a sequence of T mixture frames, wheremt ∈ RF+, t = 1, 2, . . . , T is the t-th frame. Given aset of non-negative basis vectors {wl ∈ RF+}

L

l=1, we can

represent mt (approximately) by

mt ≈L∑

l=1

wlhlt, (21)

where hlt’s are the coefficients chosen to be non-negative.By stacking mt’s column by column, we form a non-negative matrix M ∈ RF×T+ , so that (21) can be expressedin matrix form:

M ≈WH, W ≥ 0,H ≥ 0 (22)

where W has wl as its l-th column, ≥ 0 denotes el-ementwise non-negativity, and H = (hl,t). To removemultiplicative ambiguity, it is commonly assumed thateach column of W has unit `2 norm, i.e., wl’s are unitvectors. The model (22) is commonly called Non-negativeMatrix Factorization (NMF) [49] and has found wideapplications in signal and image processing. In practice,the non-negativity constraints prevent mutual cancelling ofbasis vectors and thus encourage semantically meaningfuldecompositions, which turns out highly beneficial.Assuming the phases among different sources are approx-imately the same, the power or magnitude spectrogramof the mixture can be decomposed as a summation ofthose from each sources. Therefore, after performing NMF,the sources can be separated by selecting basis vectorscorresponding to each individual source and recombiningthe source-specific basis vectors to recover the magnitude

spectrograms. In practical implementation, typically a fil-tering process similar to the classical Wiener filtering isperformed for magnitude spectrogram recovery.To determine W and H from M, one may considersolving the following optimization problem [50]:

W, H = arg minW≥0,H≥0

Dβ (M|WH) + µ‖H‖1, (23)

where Dβ is the β-divergence, which can be consideredas a generalization of the well-known Kullback-Leibler di-vergence, and µ is a regularization parameter that controlsthe sparsity of the coefficient matrix H. By employing amajorization-minimization scheme, problem (23) can besolved by the following multiplicative updates:

Hl = Hl−1 �WT

[M�

(WHl−1)β−2]

WT (WHl−1)β−1 + µ, (24)

Wl = Wl−1 �

[M�

(Wl−1Hl

)β−2]HlT

(Wl−1Hl)β−1

HlT, (25)

Normalize Wl so that the columns of Wl

have unit norm and scale Hl accordingly, (26)

for l = 1, 2, . . . . In [26], a slightly different update schemefor W was employed to encourage the discriminativepower. We omit discussing it for brevity.A deep network can be formed by unfolding these iter-ative updates. In [26], Wl’s are untied from the updaterule (25) and considered trainable parameters. In otherwords, only (24) and (26) are executed in each layer.Similar to (23), the β-divergence, with a different β value,was employed in the training loss function. A splittingscheme was also designed to preserve the non-negativityof Wl’s during training.

16

IV. CONNECTIONS TO SIGNAL PROCESSING METHODS

In addition to creating efficient and interpretable networkarchitectures which achieve superior performance in practicalapplications, algorithm unrolling can provide valuable insightsfrom a conceptual standpoint. As detailed in the previousSection, solutions to real-world signal processing problemsoften exploit domain specific prior knowledge. Inheritingthis domain knowledge is of both conceptual and practicalimportance in deep learning research. To this end, algorithmunrolling can potentially serve as a powerful tool to helpestablish conceptual connections between prior-informationguided analytical methods and modern neural networks.

In particular, algorithm unrolling may be utilized in thereverse direction: instead of unrolling a particular iterative

algorithm into a network, we can interpret a conventionalneural network as a certain iterative algorithm to be identified.Fig. 13 provides a visual illustration of applying this techniqueto MLP. The same technique is applicable to other networks,such as CNN or RNN.

By interpreting popular network architectures as conven-tional iterative algorithms, better understanding of the net-work behavior and mechanism can be obtained. Furthermore,rigorous theoretical analysis of designed networks may befacilitated once an equivalence is established with a well-understood class of iterative algorithms. Finally, architecturalenhancement and performance improvements of the neuralnetworks may result from incorporating domain knowledgeassociated with iterative algorithms.

Neural Network Training Using Extended Kalman Filter

The Kalman filter is a fundamental technique in signalprocessing with a wide range of applications. It obtainsthe Minimum Mean-Square-Error (MMSE) estimation ofsystem state by recursively drawing observed samplesand updating the estimate. The Extended Kalman Filter(EKF) extends to the nonlinear case through iterativelinearization. Previous studies [52] have revealed that EKFcan be employed to facilitate neural network training,by realizing that neural network training is essentiallya parameter estimation problem. More specifically, thetraining samples may be treated as observations, and ifthe MSE loss is chosen then network training essentiallyperforms MMSE estimation conditional on observations.Let {(x1,y1), (x2,y2), . . . , (xN ,yN )} be a collection oftraining pairs. We view the training samples as sequentiallyobserved data following a time order. At time step k, whenfeeding xk into the neural network with parameters w, itperforms a nonlinear mapping hk(·;wk) and outputs anestimate yk of yk. This process can be formally describedas the following nonlinear state-transition model:

wk+1 = wk + ωk, (27)yk = hk (xk;wk) + νk, (28)

where ωk and νk are zero-mean white Gaussian noiseswith covariance E(ωkωTl ) = δk,lQk and E(νkνTl ) =δk,lRk, respectively. Here E is the expectation operatorand δ is the Kronecker delta function. In (27), the noiseω is added artificially to avoid numerical divergence andpoor local minima [52]. For a visual depiction, refer toFig. 14.The state-transition model (27) and (28) is a special caseof the state-space model of EKF and thus we can applythe EKF technique to estimate the network parameters wk

sequentially. To begin with, at k = 0, w0 and P0 areinitialized to certain values. At time step k (k ≥ 0), the

nonlinear function hk is linearized as:

hk(xk;wk) ≈ hk(xk; wk) + Hk(wk − wk), (29)

where Hk = ∂hk∂wk

∣∣∣wk=wk

. For a neural network, Hk

is essentially the derivative of its output yk over itsparameters wk and therefore can be computed via back-propagation. The following recursion is then executed:

Kk = PkHk

(HTkPkHk + Rk

)−1,

wk+1 = wk + Kk (yk − yk) , (30)

Pk+1 = Pk −KkHTkPk + Qk,

where Kk is commonly called the Kalman gain. For de-tails on deriving the update rules (30), see [53, Chapter 1].In summary, neural networks can be trained with EKF bythe following steps:

1) Initialize w0 and P0;2) For k = 0, 1, . . . ,

a) Feed xk into the network to obtain the output yk;b) Use back-propagation to compute Hk in (29);c) Apply the recursion in (30).

The matrix Pk is the approximate error covariance matrix,which models the correlations between network parame-ters and thus delivers second-order derivative information,effectively accelerating the training speed. For example,in [54] it was shown that training an MLP using EKFrequires orders of magnitude lower number of epochs thanstandard back-propagation.In [53], some variants of the EKF training paradigm arediscussed. The neural network represented by hk can bea recurrent network, and trained in a similar fashion; thenoise covariance matrix Rk is scaled to play a role similarto the learning rate adjustment; to reduce computationalcomplexity, a decoupling scheme is employed which di-vides parameters wk into mutually exclusive groups andturns Pk into a block-diagonal matrix.

17

TABLE IISELECTED RESULTS ON RUNNING TIME AND PARAMETER COUNT OF RECENT UNROLLING WORKS AND ALTERNATIVE METHODS.

Unrolled Deep Networks Traditional Iterative Algorithms Conventional Deep Networks

Reference Yang et al. [12] Metzler et al. [51] Kulkarni et al. [45]

Running Time (sec) 2.61 12.59 2.83

Parameter Count 7.8× 104 − 3.2× 105

Reference Solomon et al. [30] Beck et al. [25] He et al. [41]

Running Time (sec) 5.07 15.33 5.36


Reference Li et al. [33] Perrone et al. [37] Kupyn et al. [46]

Running Time (sec) 1.47 1462.90 10.29


In this section we will explore close connections betweenneural networks and typical families of signal processing algo-rithms, that are clearly revealed by unrolling techniques. Wealso review selected theoretical advances that provide formalanalysis and rigorous guarantees of unrolling approaches.

A. Connections to Sparse Coding

The earliest work in establishing the connections betweenneural networks and sparse coding algorithms dates back toGregor et al. [11], which we reviewed comprehensively inSection II-B. A closely related work in the dictionary learningliterature is the task driven dictionary learning algorithmproposed by Julien et al. [55]. The idea is similar to unrolling:they view a sparse coding algorithm as a trainable system,whose parameters are the dictionary coefficients. This view-point is equivalent to unrolling the sparse coding algorithminto a “network” of infinite layers, whose output is a limitpoint of the sparse coding algorithm. The whole system istrained end-to-end (task driven) using gradient descent, andan analytical formula for the gradient is derived.

Sprechmann et al. [56] propose a framework for trainingparsimonious models, which summarizes several interestingcases through an encoder-decoder network architecture. Forexample, a sparse coding algorithm, such as ISTA, can beviewed as an encoder as it maps the input signal into itssparse code. After obtaining the sparse code, the originalsignal is recovered using the sparse code and the dictionary.This procedure can be viewed as a decoder. By concatenatingthem together, a network is formed that enables unsupervisedlearning. They further extend the model to supervised anddiscriminative learning.

Dong et al. [35] observe that the forward pass of CNNbasically executes the same operations of sparse-coding basedimage super resolution [36]. Specifically, the convolutionoperation performs patch extraction, and the ReLU operationmimics sparse coding. Nonlinear code mapping is performedin the intermediate layers. Finally, reconstruction is obtainedvia the final convolution layer. To a certain extent, this con-nection explains why CNN has tremendous success in singleimage super-resolution.

Jin et al. [23] observe the architectural similarity betweenthe popular U-net [4] and the unfolded ISTA network. As

sparse coding techniques have demonstrated great successin many image reconstruction applications such as CT re-construction, this connection helps explain why U-net is apowerful tool in these domains, although it was originallymotivated under the context of semantic image segmentation.

B. Connections to Kalman Filtering

Another line of research focuses on acceleration of neuralnetwork training by identifying its relationship with ExtendedKalman Filter (EKF). Singhal and Wu [54] demonstratedthat neural network training can be regarded as a nonlineardynamic system that may be solved by EKF. Simulationstudies show that EKF converges much more rapidly thanstandard back-propagation. Puskorius and Feldkamp [52] pro-pose a decoupled version of EKF for speed-up, and apply thistechnique to the training of recurrent neural networks. Moredetails on establishing the connection between neural networktraining and EKF are in the box “Neural Network TrainingUsing Extended Kalman Filter”. For a comprehensive reviewof techniques employing Kalman filters for network training,refer to [53].

C. Connections to Differential Equations and VariationalMethods

Differential equations and variational methods are widelyapplied in numerous signal and image processing problems.Many practical systems of differential equations require nu-merical methods for their solution, and various iterative algo-rithms have been developed. Theories around these techniquesare extensive and well-grounded, and hence it is interesting toexplore the connections between these techniques and moderndeep learning methods.

In [14], Chen and Pock adopt the unrolling approach toimprove the performance of Perone-Malik anisotropic dif-fusion [57], a well-known technique for image restorationand edge detection. After generalizing the nonlinear diffusionmodel to handle non-differentiability, they unroll the iterativediscrete partial differential equation solver, and optimize thefilter coefficients and regularization parameters through train-ing. The trained system proves to be highly effective in various

18

image reconstruction applications, such as image denoising,single image super-resolution, and JPEG deblocking.

Recently, Chen et al. [58] identify the residual layers insidethe well-known ResNet [41] as one iteration of solving adiscrete Ordinary Differential Equation (ODE), by employingthe explicit Euler method. As the time step decreases and thenumber of layers increases, the neural network output approx-imates the solution of the initial value problem representedby the ODE. Based on this finding they replace the residuallayer with an ODE solver, and analytically derive associatedback-propagation rules that enable supervised learning. In thisway, they construct a network of “continuous” depth, andachieve higher parameter efficiency over conventional ResNet.The same technique can be applied to other deep learningtechniques such as normalizing flows.

D. Selected Theoretical Studies

Although LISTA successfully achieves higher efficiencythan the iterative counterparts, it does not necessarily recovera more accurate sparse code compared to the iterative algo-rithms, and thorough theoretical analysis of its convergencebehavior is yet to be developed.

Xin et al. [59] study the unrolled Iterative Hard Threshold-ing (IHT) algorithm, which has been widely applied in `0 normconstrained estimation problems and resembles ISTA to a largeextent. The unrolled network is capable of recovering sparsesignals from dictionaries with coherent columns. Furthermore,they analyze the optimality criteria for the network to recoverthe sparse code, and verify that the network can achieve linearconvergence rate under appropriate training.

In a similar fashion, Chen et al. [20] establish a linearconvergence guarantee for the unrolled ISTA network. Theyalso derive a weight coupling scheme similar to [59]. As afollow-up, Liu et al. [19] characterize optimal network param-eters analytically by imposing mutual incoherence conditionson the network weights. Analytical derivation of the optimalparameters help reduce the parameter dimensionality to a largeextent. Furthermore, they demonstrate that a network withanalytic parameters can be as effective as the network trainedcompletely from data. For more details on the theoreticalstudies around LISTA, refer to the box “Convergence andOptimality Analysis of LISTA”.

Papyan et al. [60] interpret CNN as executing finite itera-tions of the Multi-Layer Convolutional Sparse Coding (ML-CSC) algorithm. In other words, CNN can be viewed as un-rolled ML-CSC algorithm. Notably, the convolution operationsnaturally emerge out of a convolutional sparse representation,while the commonly seen soft-thresholding operation canbe regarded as symmetrized ReLU. They also analyze theML-CSC problem and offer theoretical guarantees such asuniqueness of the multi-layer sparse representation, stabilityof the solutions under small perturbations, and effectiveness ofthe ML-CSC algorithm in terms of sparse recovery. In a recentfollow-up work [61], they further propose dedicated iterativeoptimization algorithms for solving the ML-CSC problem,and demonstrate superior reconstruction accuracy and higherefficiency over other conventional algorithms.

Papyan et al. [60] interpret CNNs as executing finiteiterations of the Multi-Layer Convolutional Sparse Coding(MLCSC) algorithm. In other words, CNN can be viewedas an unrolled ML-CSC algorithm. In this interpretation,the convolution operations naturally emerge out of a con-volutional sparse representation, with the commonly usedsoft-thresholding operation viewed as a symmetrized ReLU.They also analyze the ML-CSC problem and offer theoreticalguarantees such as uniqueness of the multi-layer sparse repre-sentation, stability of the solutions under small perturbations,and effectiveness of ML-CSC in terms of sparse recovery. Ina recent follow-up work [61], they further propose dedicatediterative optimization algorithms for solving the ML-CSCproblem, and demonstrate superior efficiency over other con-ventional algorithms such as ADMM and FISTA for solvingthe multi-layer bais purisuit problem.

V. PERSPECTIVES AND RECENT TRENDS

A. Distilling the Power of Algorithm Unrolling

In recent years, algorithm unrolling has proved highly effec-tive in achieving superior performance and higher efficiencyin many practical domains. A question that naturally arises is,why is it so powerful?

Fig. 15 provides a high-level illustration of how algorithmunrolling can be advantageous compared with both traditionaliterative algorithms and generic neural networks, from afunctional approximation perspective. By parameter tuningand customizations, a traditional iterative algorithm spans arelatively small subset of the functions of interest, and thushas limited representation power. Consequently, it is capable ofapproximating a given target function reasonably well, whilestill leaving some gaps that undermine performance in prac-tice. Nevertheless, iterative algorithms generalize relativelywell in limited training scenarios. From a statistical learningperspective, iterative algorithms correspond to models of highbias but low variance.

On the other hand, a generic neural network is capableof more accurately approximating the target function thanksto its universal approximation capability. Nevertheless, as ittypically consists of an enormous number of parameters, itconstitutes a large subset in the function space. Therefore,when performing network training the search space becomeslarge and training is a major challenge. The high dimensional-ity of parameters also requires an abundant of training samplesand generalization becomes an issue. Furthermore, network ef-ficiency may also suffer as the network size increases. Genericneural networks are essentially models of high variance butlow bias.

In contrast, the unrolled network, by expanding the capacityof iterative algorithms, can approximate the target functionmore accurately, while spanning a relatively small subset inthe function space. Reduced size of the search space alleviatesthe burden of training and requirement of large scale trainingdatasets. Since iterative algorithms are carefully developedbased on domain knowledge and already provide reasonablyaccurate approximation of the target function, by extendingthem and training from real data unrolled networks can often

19

obtain highly accurate approximation of the target functions.As an intermediate state between generic networks and iter-ative algorithms, unrolled networks typically have relatively

low bias and variance simultaneously. Table III summarizessome features of iterative algorithms, generic networks andunrolled networks.

Convergence and Optimality Analysis of LISTA

Although it is shown in [11] that LISTA achieves empir-ical higher efficiency than ISTA through training, severalconceptual issues remain to be addressed. First, LISTAdoes not exhibit superior performance over ISTA, not evenunder particular scenarios; second, the convergence rateof LISTA is unknown; third, LISTA actually differs fromISTA by introducing artificial parameter substitutions; andfinally, the optimal parameters are learned from data, andit is difficult to have a sense of what they look like.To address these open issues, several recent theoreticalstudies have been conducted. A common assumption isthat there exists a sparse code x∗ ∈ Rm which ap-proximately satisfies the linear model y ≈ Wx∗ whereW ∈ Rn×m and m > n. More specifically, it is commonlyassumed that ‖x∗‖0 ≤ s for some positive integer s, where‖ · ‖0 counts the number of nonzero entries.Xin et al. [59] examine a closely related sparse codingproblem:

minx

1

2‖y −Wx‖22 subject to ‖x‖0 ≤ k, (31)

where k is a pre-determined integer to control the sparsitylevel of x. In [59] a network is constructed by unrolling theIterative Hard-Thresholding (IHT) algorithm, which hassimilar form as ISTA. At layer l, the following iteration isperformed:

xl+1 = Hk{Wtx

l + Wey}, (32)

where Hk is the hard-thresholding operator which keepsk coefficients of largest magnitude and zeroes out the rest.Xin et al. [59] proved that in order for the IHT-basednetwork to recover x∗, it must be the case that

Wt = I− ΓW, (33)

for some matrix Γ, which implies that the implicit variablesubstitution Wt = I− 1

µWTW and We =1µWT may not

play such a big role as it seems, as long as the network is

trained properly so that it acts as a generic sparse recoveryalgorithm. Furthermore, Xin et al. showed that upon somemodifications such as using layer-specific parameters, thelearned network can recover the sparse code even when Wadmits correlated columns, a scenario known to be partic-ularly challenging for traditional iterative sparse codingalgorithms.Chen et al. [20] perform similar analysis on LISTA withlayer-specific parameters, i.e., in layer l the parameters(Wl

t,Wle, λ

l)

are used instead. Similar to Xin et al. [59],they also proved that under certain mild assumptions,whenever LISTA recovers x∗, the following weight cou-pling scheme must be satisfied asymptotically:

Wlt −(I−Wl

eW)→ 0, as l→∞,

which shows that the implicit variable substitutions maybe inconsequential in an asymptotic sense. Therefore, theyadopted the following coupled parameterization scheme:

Wlt = I−Wl

eW,

and proved that the resulting network recovers x∗ in alinear rate, if the parameters

(Wk

e , λk)∞k=1

are appropri-ately selected. They further integrate a support selectionscheme into the network. The network thus has bothweight coupling and support selection structures and iscalled LISTA-CPSS.Liu et al. [19] extend Chen et al. [20]’s work by an-alytically characterizing the optimal weights Wk

e forLISTA-CPSS. They proved that, under certain regularityconditions, a linear convergence rate can be achieved if(Wk

e , λk)k

are chosen in a specific form. This impliesthat the network with analytic parameters can be asymp-totically as efficient as the trained version. Although theanalytic forms may be nontrivial to compute in practice,their analysis helps reducing the number of network pa-rameters dramatically.

B. Trends: Expanding Application Landscape and AddressingImplementation Concerns

A continuous trend in recent years is to explore moregeneral underlying iterative algorithms. Earlier unrolling ap-proaches were centered around the ISTA algorithm [11], [23],[22], while recently other alternatives have been pursuedsuch as Proximal Splitting [56], ADMM [12], Half QuadraticSplitting [13], to name a few. Consequently, a growing numberof unrolling approaches as well as novel unrolled networkarchitectures appear in recent publications.

In addition to expanding the methodology, researchers are

broadening the application scenarios of algorithm unrolling.For instance, in communications Samuel et al. [62] proposea deep network called DetNet, based on unrolling the pro-jected gradient descent algorithm for least squares recov-ery. In multiple-input-multiple-output detection tasks, Det-Net achieves similar performance to a detector based onsemidefinite relaxation, while being than 30 times faster. Fur-thermore, DetNet exhibits promising performance in handlingill-conditioned channels, and is more robust than the approx-imate message passing based detector as it does not requireknowledge of the noise variance. More examples of emerging

20

TABLE IIIFEATURE COMPARISONS OF ITERATIVE ALGORITHMS, GENERIC DEEP NETWORKS, AND UNROLLED NETWORKS.

Techniques Performance Efficiency Parameter Dimensional-ity

Interpretability Generalizability

Iterative Algorithms Low Low Low High High

Generic Deep Networks High High High Low Low

Unrolled Networks High High Middle High Middle

xk

NeuralNetworkhk (·;wk)

yk MSE yk

Time Step k

hk+1 (·;wk+1)Neural

Networkxk+1 yk+1 MSE yk+1

Time Step k + 1

+ωk

Fig. 14. Visual illustration of the state-transition model for neural networktraining. The training data can be viewed as sequentially feeding through theneural network, and the network parameters can be viewed as system states.For more details, refer to the box “Neural Network Training Using ExtendedKalman Filter”.

Target Itera

tive Algo

rithm

Unroll

edNetw

ork

Generic Neural Network

Fig. 15. A high-level unified interpretation of algorithm unrolling from afunctional approximation perspective.

unrolling techniques in communications can be found in [63].The unrolled network can share parameters across all the

layers, or carry over layer-specific parameters. In the formercase, the networks are typically more parameter efficient.However, how to train the network effectively is a challengebecause the networks essentially resemble RNNs and maysimilarly suffer from gradient explosion and vanishing prob-lems. In the latter case, the networks slightly deviate from theoriginal iterative algorithm and may not completely inherit its


for l = 1, 2, . . . , L− 1 do

yl ← f(xl),

zl ← g(yl),

xl+1 ← h(zl),

end for

NeuralNetwork

Fig. 16. An alternative approach to algorithm unrolling is to replace one stepof the iterative algorithm with an intact conventional neural network.

theoretical benefits such as convergence guarantees. However,the networks can have enhanced representation power andadapt to real world scenarios more accurately. The trainingmay also be much easier compared with RNNs. In recentyears, a growing number of unrolling techniques allow theparameters to vary from layer to layer.

An interesting concern relates to deployment of neuralnetworks on resource-constrained platforms, such as digitalsingle-lens reflex cameras and mobile devices. The heavystorage demand renders many top-performing deep networksimpractical, while straightforward network compression usu-ally deteriorates its performance significantly. Therefore, inaddition to computational efficiency, nowadays researchers arepaying increasing attention to the parameter efficiency aspect,and increasing research attention is paid to algorithm unrolling.

Finally, there are other factors to be considered whenconstructing unrolled networks. In particular, many iterativealgorithms when unrolled straightforwardly may introducehighly non-linear and/or non-smooth operations such as hard-thresholding. Therefore, it is usually desirable to design al-gorithms whose iteration procedures are either smooth or canbe well approximated by smooth operators. Another aspectrelates to the network depth. Although deeper networks offerhigher representation power, they are generally harder to trainin practice [41]. Indeed, techniques such as stacked pre-training have been frequently employed in existing algorithmunrolling approaches to overcome the training difficulty tosome extent. Taking this into account, iterative algorithms withfaster convergence rate and simpler iterative procedures aregenerally considered more often.

21

C. Alternative Approaches

Besides algorithm unrolling, there are other approaches forcharacterizing or enhancing interpretability of deep networks.The initial motivation of neural networks is to model thebehavior of the human brain. Traditionally neural networks areoften interpreted from a neurobiological perspective. However,discrepancies between actual human brain and artificial neuralnetworks have been constantly observed. In recent years, thereare other interesting works on identifying and quantifyingnetwork interpretability by analyzing the correlations betweenneuron activations and human perception. One such example isthe emerging technique called “network dissection” [64] whichstudies how the neurons capture semantic objects in the sceneand how state-of-the art networks internally represent high-level visual concepts. Specifically, Zhou et al. [64] analyzethe neuron activations on pixel-level annotated datasets, andquantify the network interpretability by correlating the neuronactivations with groundtruth annotations. Bau et al. [65] ex-tends this technique to generative adversarial networks. Theseworks complement algorithm unrolling by offering visual andbiological interpretations of deep networks. However, theyare mainly focused on characterizing the interpretability ofexisting networks and are less effective at connecting neuralnetworks with traditional iterative algorithms and motivatingnovel network architectures.

Another closely related technique is to employ a con-ventional deep network as drop-in replacement of certainprocedures in an iterative algorithm. Fig. 16 provides a visualillustration of this technique. The universal approximation the-orem [66] justifies the use of neural networks to approximatethe algorithmic procedures, as long as they can be representedas continuous mappings. For instance, in [51] Metzler et al.observe that one step of ISTA may be treated as a denoisingprocedure, and henceforth can be replaced by a denoisingCNN. The same approach applies to AMP, an extension toISTA. In a similar fashion, in [67] Gupta et al. replace theprojection procedure in projected gradient descent with adenoising CNN. Shlezinger et al. [68] replace the evaluation oflog-likelihood in the Viterbi algorithm with dedicated machinelearning methods, including a deep neural network.

This technique has the advantage of inheriting the knowl-edge about conventional deep networks, such as network ar-chitectures, training algorithms, initialization schemes, etc. Inaddition, in practice this technique can effectively complementthe limitations of iterative algorithms. For instance, Shlezingeret al. [68] demonstrated that, by replacing part of the Viterbialgorithm with a neural network, full knowledge about thestatistical relationship between channel input and output is nolonger necessary. Therefore, the resulting algorithm achieveshigher robustness and better performance under model im-perfections. Nevertheless, the procedures themselves are stillapproximated abstractly via conventional neural networks.

VI. CONCLUSIONS

In this article we provide an extensive review of algorithmunrolling, starting with LISTA as a basic example. We then

showcased practical applications of unrolling in various real-world signal and image processing problems. In many appli-cation domains, the unrolled interpretable deep networks offerstate-of-the art performance, and achieve high computationalefficiency. From a conceptual standpoint, algorithm unrollingalso helps reveal the connections between deep learning andother important categories of approaches, that are widelyapplied for solving signal and image processing problems.

Although algorithm unrolling is a promising technique tobuild efficient and interpretable neural networks, and hasalready achieved success in many domains, it is still evolving.We conclude this article by discussing limitations and openchallenges related to algorithm unrolling, and suggest possibledirections for future research.

Proper Training of the Unrolled Networks: The unrollingtechniques provide a powerful principled framework for con-structing interpretable and efficient deep networks; neverthe-less, the full potential of unrolled networks can be exploitedonly when they are trained appropriately. Compared to pop-ular conventional networks (CNNs, auto-encoders), unrollednetworks usually exhibit customized structures. Therefore,existing training schemes may not work well. In addition, theunrolled network sometimes delivers shared parameters amongdifferent layers, and thus it resembles RNN, which is well-known to be difficult to train [69]. Therefore, many existingworks apply greedy layer-wise pre-training. The developmentof well-grounded end-to-end training schemes for unrollednetworks continues to be a topic of great interest.

A topic of paramount importance is how to initialize thenetwork. While there are well studied methods for initializingconventional networks [70], [2], how to systematically transfersuch knowledge to customized unrolled networks remains achallenge. In addition, how to prevent vanishing and explodinggradients during training is another important issue. Develop-ing equivalents or counterparts of established practices suchas Batch Normalization [71] and Residual Learning [41] forunrolled networks is a viable research direction.

Bridging the Gap between Theory and Practice: Whilesubstantial progress has been achieved towards understandingthe network behavior through unrolling, more works need tobe done to thoroughly understand its mechanism. Although theeffectiveness of some networks on image reconstruction taskshas been explained somehow by drawing parallels to sparsecoding algorithms, it is still mysterious why state-of-the artnetworks perform well on various recognition tasks. Further-more, unfolding itself is not uniquely defined. For instance,there are multiple ways to choose the underlying iterativealgorithms, to decide what parameters become trainable andwhat parameters to fix, and more. A formal study on how thesechoices affect convergence and generalizability can providevaluable insights and practical guidance.

Another interesting direction is to develop a theory thatprovides guidance for practical applications. For instance, itis interesting to perform analysis that guide practical networkdesign choices, such as dimensions of parameters, networkdepth, etc. It is particularly interesting to identify factors thathave high impact on network performance.

Improving the Generalizability: One of the critical limita-

22

tions of common deep networks is its lack of generalizabil-ity, i.e., severe performance degradations when operating ondatasets significantly different from training. Compared withneural networks, iterative algorithms usually generalize better,and it is interesting to explore how to maintain this property inthe unrolled networks. Preliminary investigations have shownimproved generalization experimentally of unrolled networksin a few cases [33] but a formal theoretic understandingremains elusive and is highly desirable. From an impactstandpoint, this line of research may provide newer additionsto approaches for semi-supervised/unsupervised learning, andoffer practical benefits when training data is limited or whenworking with resource-constrained platforms.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E Hinton, “ImageNet Classificationwith Deep Convolutional Neural Networks,” in Adv. Neural Inform.Process. Syst., 2012, pp. 1097–1105.

[2] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification,” inProc. IEEE Int. Conf. Computer Vision. Dec. 2015, pp. 1026–1034,IEEE.

[3] J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and L. Fei-Fei, “Imagenet:A Large-scale Hierarchical Image Database,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, June 2009, pp. 248–255.

[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional Net-works for Biomedical Image Segmentation,” in Int. Conf. Medical ImageComputing and Computer Assisted Intervention, 2015, pp. 234–241.

[5] N. Ibtehaz and M. S. Rahman, “MultiResUNet: Rethinking the U-NetArchitecture for Multimodal Biomedical Image Segmentation,” NeuralNetworks, 2019.

[6] G. Nishida, A. Bousseau, and D. G. Aliaga, “Procedural Modelingof a Building from a Single Image,” Computer Graphics Forum(Eurographics), vol. 37, no. 2, 2018.

[7] M. Tofighi, T. Guo, J. K. P. Vanamala, and V. Monga, “Prior InformationGuided Regularized Deep Learning for Cell Nucleus Detection,” IEEETrans. Med. Imag., vol. 38, no. 9, pp. 2047–2058, Sep. 2019.

[8] T. Guo, H. Seyed Mousavi, and V. Monga, “Adaptive Transform DomainImage Super-Resolution via Orthogonally Regularized Deep Networks,”IEEE Trans. Image Process., vol. 28, no. 9, pp. 4685–4700, Sep. 2019.

[9] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-endLearning Face Super-Resolution with Facial Priors,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, 2018, pp. 2492–2501.

[10] A. Lucas, M. Iliadis, R. Molina, and A.K. Katsaggelos, “Using DeepNeural Networks for Inverse Problems in Imaging: Beyond AnalyticalMethods,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 20–36, 2018.

[11] K. Gregor and Y. LeCun, “Learning Fast Approximations of SparseCoding,” in Proc. Int. Conf. Machine Learning, 2010.

[12] Y. Yang, J. Sun, H. LI, and Z. Xu, “ADMM-CSNet: A Deep LearningApproach for Image Compressive Sensing,” IEEE Trans. Pattern Anal.Mach. Intell., pp. 1–1, to appear, 2019.

[13] Y. Li, M. Tofighi, V. Monga, and Y. C. Eldar, “An Algorithm UnrollingApproach to Deep Image Deblurring,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing, 2019.

[14] Y. Chen and T. Pock, “Trainable Nonlinear Reaction Diffusion: AFlexible Framework for Fast and Effective Image Restoration,” IEEETrans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1256–1272, 2017.

[15] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier NeuralNetworks,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2011,pp. 315–323.

[16] Y. A. LeCun, L. Bottou, G. B. Orr, and k. Mller, “Efficient BackProp,”in Neural Networks: Tricks of the Trade, Lecture Notes in ComputerScience, pp. 9–48. Springer, Berlin, Heidelberg, 2012.

[17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedLearning Applied to Document Recognition,” Proc. IEEE, vol. 86, no.11, pp. 2278–2324, Nov. 1998.

[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning InternalRepresentations by Error Propagation,” in Parallel Distributed Pro-cessing: Explorations in the Microstructure of Cognition, Vol. 1, D. E.Rumelhart, J. L. McClelland, and CORPORATE PDP Research Group,Eds., pp. 318–362. MIT Press, Cambridge, MA, USA, 1986.

[19] J. Liu, X. Chen, Z. Wang, and W. Yin, “ALISTA: Analytic Weights AreAs Good As Learned Weights in LISTA,” in Proc. Int. Conf. LearningRepresentation, 2019.

[20] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical Linear Convergenceof Unfolded ISTA and Its Practical Weights and Thresholds,” in Adv.Neural Inform. Process. Syst., 2018.

[21] Y. Li and S. Osher, “Coordinate Descent Optimization for L1 Minimiza-tion with Application to Compressed Sensing: A Greedy Algorithm,”Inverse Problems & Imaging, vol. 3, pp. 487, 2009.

[22] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep Networks forImage Super-Resolution with Sparse Prior,” in Proc. IEEE Int. Conf.Computer Vision, 2015, pp. 370–378.

[23] K. H. Jin, M. T McCann, E. Froustey, and M. Unser, “Deep Convolu-tional Neural Network for Inverse Problems in Imaging,” IEEE Trans.Image Process., vol. 26, no. 9, pp. 4509–4522, 2017.

[24] Y. C Eldar and G. Kutyniok, Compressed Sensing: Theory andApplications, Cambridge university press, 2012.

[25] A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-thresholdingAlgorithm for Linear Inverse Problems,” SIAM J. Imaging Sci., vol.2, no. 1, pp. 183–202, 2009.

[26] J. R Hershey, J. Le Roux, and F. Weninger, “Deep Unfolding:Model-based Inspiration of Novel Deep Architectures,” arXiv preprintarXiv:1409.2574, 2014.

[27] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,C. Huang, and P. H. S. Torr, “Conditional Random Fields as RecurrentNeural Networks,” in Proc. Int. Conf. Computer Vision. Dec. 2015, pp.1529–1537, IEEE.

[28] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Scholkopf, “Learningto Deblur,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 7, pp.1439–1451, July 2016.

[29] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep Learning MarkovRandom Field for Semantic Segmentation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 40, no. 8, pp. 1814–1828, Aug. 2018.

[30] O. Solomon, R. Cohen, Y. Zhang, Y. Yang, Q. He, J. Luo, R. J. G. vanSloun, and Y. C. Eldar, “Deep Unfolded Robust PCA with Applicationto Clutter Suppression in Ultrasound,” IEEE Trans. Med. Imag., 2019,to appear.

[31] Y. Ding, X. Xue, Z. Wang, Z. Jiang, X. Fan, and Z. Luo, “DomainKnowledge Driven Deep Unrolling for Rain Removal from SingleImage,” in Int. Conf. Digital Home. IEEE, 2018, pp. 14–19.

[32] Z. Q. Wang, J. L. Roux, D. Wang, and J. R Hershey, “End-to-endSpeech Separation with Unfolded Iterative Phase Reconstruction,” inProc. Interspeech, 2018.

[33] Y. Li, M. Tofighi, J. Geng, V. Monga, and Y. C Eldar, “Efficient andInterpretable Deep Blind Image Deblurring via Algorithm Unrolling,”arXiv preprint arXiv:1902.03493, 2019.

[34] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted AnchoredNeighborhood Regression for Fast Super-Resolution,” in Proc. AsianConf. Computer Vision. Springer, 2014, pp. 111–126.

[35] C. Dong, C. C. Loy, K. He, and X. Tang, “Image Super-ResolutionUsing Deep Convolutional Networks,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 38, no. 2, pp. 295–307, Feb 2016.

[36] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image Super-Resolutionvia Sparse Representation,” IEEE Trans. Image Process., vol. 19, no.11, pp. 2861–2873, Nov 2010.

[37] D. Perrone and P. Favaro, “A Clearer Picture of Total Variation BlindDeconvolution,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no.6, pp. 1041–1055, June 2016.

[38] S. Nah, T. H. Kim, and K. M. Lee, “Deep Multi-scale ConvolutionalNeural Network for Dynamic Scene Deblurring,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2017, vol. 1, p. 3.

[39] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-Recurrent Networkfor Deep Image Deblurring,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2018, pp. 8174–8182.

[40] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “DistributedOptimization and Statistical Learning via the Alternating DirectionMethod of Multipliers,” Foundations and Trends R© in Machine learning,vol. 3, no. 1, pp. 1–122, 2011.

[41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning forImage Recognition,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, June 2016, pp. 770–778.

[42] J. Eggert and E. Korner, “Sparse Coding and NMF,” in Proc. IEEE Int.Jt. Conf. Neural Networks, July 2004, vol. 4, pp. 2529–2533 vol.4.

[43] D. Gunawan and D. Sen, “Iterative Phase Estimation for the Synthesisof Separated Sources from Single-Channel Mixtures,” IEEE SignalProcess. Lett., vol. 17, no. 5, pp. 421–424, May 2010.

23

[44] C. A Metzler, A. Maleki, and R. G Baraniuk, “From Denoising toCompressed Sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp.5117–5144, 2016.

[45] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet:Non-iterative Reconstruction of Images from Compressively SensedMeasurements,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2016, pp. 449–458.

[46] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas,“DeblurGAN: Blind Motion Deblurring Using Conditional AdversarialNetworks,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, June 2018.

[47] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networksfor Semantic Segmentation,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2015, pp. 3431–3440.

[48] P. Krahenbuhl and V. Koltun, “Efficient Inference in Fully ConnectedCRFs with Gaussian Edge Potentials,” in Adv. Neural Inform. Process.Syst., 2011, pp. 109–117.

[49] D. D Lee and H S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization,” Nature, vol. 401, no. 6755, pp. 788,1999.

[50] C. Fvotte, N. Bertin, and J. Durrieu, “Nonnegative Matrix Factorizationwith the Itakura-Saito Divergence: with Application to Music Analysis,”Neural Computation, vol. 21, no. 3, pp. 793–830, March 2009.

[51] C. Metzler, A. Mousavi, and R. Baraniuk, “Learned D-AMP: PrincipledNeural Network based Compressive Image Recovery,” in Adv. NeuralInform. Process. Syst., 2017, pp. 1772–1783.

[52] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of NonlinearDynamical Systems with Kalman Filter Trained Recurrent Networks,”IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 279–297, March 1994.

[53] S. S Haykin, Kalman Filtering and Neural Networks, Wiley, New York,2001.

[54] S. Singhal and L. Wu, “Training Multilayer Perceptrons with theExtended Kalman Algorithm,” in Adv. Neural Inform. Process. Syst.,D. S. Touretzky, Ed., pp. 133–140. Morgan-Kaufmann, 1989.

[55] J. Mairal, F. Bach, and J. Ponce, “Task-Driven Dictionary Learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 791–804,April 2012.

[56] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Learning EfficientSparse and Low Rank Models,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 37, no. 9, pp. 1821–1833, Sept. 2015.

[57] P. Perona and J. Malik, “Scale-space and Edge Detection UsingAnisotropic Diffusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol.12, no. 7, pp. 629–639, July 1990.

[58] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K Duvenaud, “NeuralOrdinary Differential Equations,” in Adv. Neural Inform. Process. Syst.,2018, pp. 6571–6583.

[59] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal Sparsitywith Deep Networks?,” in Adv. Neural Inform. Process. Syst., 2016, pp.4340–4348.

[60] V. Papyan, Y. Romano, and M. Elad, “Convolutional Neural NetworksAnalyzed via Convolutional Sparse Coding,” J. Mach. Learn. Res., vol.18, pp. 1–52, July 2017.

[61] J. Sulam, A. Aberdam, A. Beck, and M. Elad, “On Multi-Layer BasisPursuit, Efficient Algorithms and Convolutional Neural Networks,” IEEETrans. Pattern Anal. Mach. Intell., pp. 1–1, to appear, 2019.

[62] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO Detection,”in Proc. Int. Workshop on Signal Processing Advances in WirelessCommunications, July 2017, pp. 1–5.

[63] A. Balatsoukas-Stimming and C. Studer, “Deep Unfolding for Commu-nications Systems: A Survey and Some New Directions,” arXiv preprintarXiv:1906.05774, 2019.

[64] B. Zhou, D. Bau, A. Oliva, and A. Torralba, “Interpreting Deep VisualRepresentations via Network Dissection,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 41, no. 9, pp. 2131–2145, Sep. 2019.

[65] D. Bau, J. Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman,and A. Torralba, “GAN Dissection: Visualizing and UnderstandingGenerative Adversarial Networks,” in Proc. Int. Conf. Learning Rep-resentations, 2019.

[66] G. Cybenko, “Approximation by Superpositions of A Sigmoidal Func-tion,” Math. Control Signal Systems, vol. 2, no. 4, pp. 303–314, Dec.1989.

[67] H. Gupta, K. H. Jin, H. Q Nguyen, M. T McCann, and M. Unser,“CNN-based Projected Gradient Descent for Consistent CT ImageReconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1440–1453, 2018.

[68] N. Shlezinger, Y. C. Eldar, N. Farsad, and A. J. Goldsmith, “ViterbiNet:Symbol Detection Using A Deep Learning Based Viterbi Algorithm,”in IEEE Int. Workshop on Signal Processing Advances in WirelessCommunications, July 2019, pp. 1–5.

[69] R. Pascanu, T. Mikolov, and Y. Bengio, “On the Difficulty of TrainingRecurrent Neural Networks,” in Proc. Int. Conf. Machine Learning,2013, pp. 1310–1318.

[70] X. Glorot and Y. Bengio, “Understanding the Difficulty of Training DeepFeedforward Neural Networks,” in Proc. Int. Conf. Aquatic InvasiveSpecies, Mar. 2010.

[71] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift,” in Proc. Int.Conf. Machine Learning, 2015, pp. 448–456.

algorithm unrolling: interpretable, efﬁcient deep …1 algorithm unrolling: interpretable,...

Documents