3194 ieee transactions on image processing, …dingliu2/iccv15/tip16.pdf · index terms—image...

14
3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016 Robust Single Image Super-Resolution via Deep Networks With Sparse Prior Ding Liu, Student Member, IEEE, Zhaowen Wang, Member, IEEE, Bihan Wen, Student Member, IEEE, Jianchao Yang, Member, IEEE, Wei Han, and Thomas S. Huang, Fellow, IEEE Abstract—Single image super-resolution (SR) is an ill-posed problem, which tries to recover a high-resolution image from its low-resolution observation. To regularize the solution of the problem, previous methods have focused on designing good priors for natural images, such as sparse representation, or directly learning the priors from a large data set with models, such as deep neural networks. In this paper, we argue that domain expertise from the conventional sparse coding model can be combined with the key ingredients of deep learning to achieve further improved results. We demonstrate that a sparse coding model particularly designed for SR can be incarnated as a neural network with the merit of end-to-end optimization over training data. The network has a cascaded structure, which boosts the SR performance for both fixed and incremental scaling factors. The proposed training and testing schemes can be extended for robust handling of images with additional degradation, such as noise and blurring. A subjective assessment is conducted and analyzed in order to thoroughly evaluate various SR techniques. Our proposed model is tested on a wide range of images, and it significantly outperforms the existing state-of-the-art methods for various scaling factors both quantitatively and perceptually. Index Terms— Image super-resolution, deep neural networks, sparse coding. I. I NTRODUCTION S INGLE image super-resolution is usually cast as an inverse problem of recovering the original high- resolution (HR) image from one low-resolution (LR) obser- vation image. Since the known variables in LR images are greatly outnumbered by the unknowns in typically desired HR images, this problem is highly ill-posed and has limited the use of SR techniques in many practical applications [1], [2]. A large number of single image SR methods have been proposed, exploiting various priors of natual images to regu- larize the solution of SR. Analytical priors, such as bicubic Manuscript received February 16, 2016; revised April 30, 2016; accepted May 1, 2016. Date of publication May 6, 2016; date of current version May 23, 2016. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jie Liang. D. Liu, W. Han, and T. S. Huang are with the Department of Electrical and Computer Engineering, Beckman Institute, University of Illinois at Urbana–Champaign, Urbana, IL 61801 USA (e-mail: [email protected]; [email protected]; [email protected]). Z. Wang is with Adobe Systems Inc., San Jose, CA 95110 USA (e-mail: [email protected]). B. Wen is with the Coordinated Science Laboratory, Department of Electri- cal and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, IL 61801 USA (e-mail: [email protected]). J. Yang is with Snapchat Inc., Venice, CA 90291 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2016.2564643 interpolation, work well for smooth regions; while image models based on statistics of edges [3] and gradients [4] can recover sharper structures. In the patch-based SR methods, HR patch candidates are represented as the sparse lin- ear combination of dictionary atoms trained from external databases [5]–[7], or recovered from similar examples in the LR image itself at different locations and across different scales [8], [9]. A regression model is built between LR and HR patches in [10] and [11]. A comprehensive review of more SR methods can be found in [12]. More recently, inspired by the great success achieved by deep learning [13] in other computer vision tasks, people begin to use neural networks with deep architecture for image SR. Multiple layers of collaborative auto-encoders are stacked together in [14] and [15] for robust matching of self-similar patches. Deep convolutional neural networks (CNN) [16] and deconvolutional networks [17] are designed that directly learn the non-linear mapping from LR space to HR space in a way similar to coupled sparse coding [6]. As these deep networks allow end-to-end training of all the model components between LR input and HR output, significant improvements have been observed over their shadow counterparts. The networks in [14] and [16] are built with generic architectures, which means all their knowledge about SR is learned from training data. On the other hand, people’s domain expertise for the SR problem, such as natural image prior and image degradation model, is largely ignored in deep learning based approaches. It is then worthwhile to investigate whether domain expertise can be used to design better deep model architectures, or whether deep learning can be leveraged to improve the quality of handcrafted models. In this paper, we extend the conventional sparse coding model [5] using several key ideas from deep learning, and show that domain expertise is complementary to large learn- ing capacity in further improving SR performance. First, based on the learned iterative shrinkage and thresholding algorithm (LISTA) [18], we implement a feed-forward neural network in which each layer strictly correspond to one step in the processing flow of sparse coding based image SR. In this way, the sparse representation prior is effectively encoded in our network structure; at the same time, all the components of sparse coding can be trained jointly through back-propagation. This simple model, which is named sparse coding based network (SCN), achieves notable improvement over the generic CNN model [16] in terms of both recov- ery accuracy and human perception, and yet has a compact 1057-7149 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: dinhbao

Post on 27-Mar-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

Robust Single Image Super-Resolution viaDeep Networks With Sparse Prior

Ding Liu, Student Member, IEEE, Zhaowen Wang, Member, IEEE, Bihan Wen, Student Member, IEEE,Jianchao Yang, Member, IEEE, Wei Han, and Thomas S. Huang, Fellow, IEEE

Abstract— Single image super-resolution (SR) is an ill-posedproblem, which tries to recover a high-resolution image fromits low-resolution observation. To regularize the solution of theproblem, previous methods have focused on designing good priorsfor natural images, such as sparse representation, or directlylearning the priors from a large data set with models, suchas deep neural networks. In this paper, we argue that domainexpertise from the conventional sparse coding model can becombined with the key ingredients of deep learning to achievefurther improved results. We demonstrate that a sparse codingmodel particularly designed for SR can be incarnated as a neuralnetwork with the merit of end-to-end optimization over trainingdata. The network has a cascaded structure, which boosts theSR performance for both fixed and incremental scaling factors.The proposed training and testing schemes can be extended forrobust handling of images with additional degradation, such asnoise and blurring. A subjective assessment is conducted andanalyzed in order to thoroughly evaluate various SR techniques.Our proposed model is tested on a wide range of images, andit significantly outperforms the existing state-of-the-art methodsfor various scaling factors both quantitatively and perceptually.

Index Terms— Image super-resolution, deep neural networks,sparse coding.

I. INTRODUCTION

S INGLE image super-resolution is usually cast asan inverse problem of recovering the original high-

resolution (HR) image from one low-resolution (LR) obser-vation image. Since the known variables in LR images aregreatly outnumbered by the unknowns in typically desired HRimages, this problem is highly ill-posed and has limited theuse of SR techniques in many practical applications [1], [2].

A large number of single image SR methods have beenproposed, exploiting various priors of natual images to regu-larize the solution of SR. Analytical priors, such as bicubic

Manuscript received February 16, 2016; revised April 30, 2016; acceptedMay 1, 2016. Date of publication May 6, 2016; date of current version May 23,2016. The associate editor coordinating the review of this manuscript andapproving it for publication was Prof. Jie Liang.

D. Liu, W. Han, and T. S. Huang are with the Department of Electricaland Computer Engineering, Beckman Institute, University of Illinois atUrbana–Champaign, Urbana, IL 61801 USA (e-mail: [email protected];[email protected]; [email protected]).

Z. Wang is with Adobe Systems Inc., San Jose, CA 95110 USA (e-mail:[email protected]).

B. Wen is with the Coordinated Science Laboratory, Department of Electri-cal and Computer Engineering, University of Illinois at Urbana–Champaign,Urbana, IL 61801 USA (e-mail: [email protected]).

J. Yang is with Snapchat Inc., Venice, CA 90291 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2016.2564643

interpolation, work well for smooth regions; while imagemodels based on statistics of edges [3] and gradients [4] canrecover sharper structures. In the patch-based SR methods,HR patch candidates are represented as the sparse lin-ear combination of dictionary atoms trained from externaldatabases [5]–[7], or recovered from similar examples in theLR image itself at different locations and across differentscales [8], [9]. A regression model is built between LR andHR patches in [10] and [11]. A comprehensive review of moreSR methods can be found in [12].

More recently, inspired by the great success achieved bydeep learning [13] in other computer vision tasks, people beginto use neural networks with deep architecture for image SR.Multiple layers of collaborative auto-encoders are stackedtogether in [14] and [15] for robust matching of self-similarpatches. Deep convolutional neural networks (CNN) [16] anddeconvolutional networks [17] are designed that directly learnthe non-linear mapping from LR space to HR space in a waysimilar to coupled sparse coding [6]. As these deep networksallow end-to-end training of all the model components betweenLR input and HR output, significant improvements have beenobserved over their shadow counterparts.

The networks in [14] and [16] are built with genericarchitectures, which means all their knowledge about SR islearned from training data. On the other hand, people’s domainexpertise for the SR problem, such as natural image prior andimage degradation model, is largely ignored in deep learningbased approaches. It is then worthwhile to investigate whetherdomain expertise can be used to design better deep modelarchitectures, or whether deep learning can be leveraged toimprove the quality of handcrafted models.

In this paper, we extend the conventional sparse codingmodel [5] using several key ideas from deep learning, andshow that domain expertise is complementary to large learn-ing capacity in further improving SR performance. First,based on the learned iterative shrinkage and thresholdingalgorithm (LISTA) [18], we implement a feed-forward neuralnetwork in which each layer strictly correspond to one stepin the processing flow of sparse coding based image SR.In this way, the sparse representation prior is effectivelyencoded in our network structure; at the same time, all thecomponents of sparse coding can be trained jointly throughback-propagation. This simple model, which is named sparsecoding based network (SCN), achieves notable improvementover the generic CNN model [16] in terms of both recov-ery accuracy and human perception, and yet has a compact

1057-7149 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3195

model size. Moreover, with the correct understanding of eachlayer’s physical meaning, we have a more principled wayto initialize the parameters of SCN, which helps to improveoptimization speed and quality.

A single network is only able to perform image SR bya particular scaling factor. In [16], different networks aretrained for different scaling factors. In this paper, we proposea cascade of multiple SCNs to achieve SR for arbitraryfactors. This approach, motivated by the self-similarity basedSR approach [8], not only increases the scaling flexibilityof our model, but also reduces artifacts for large scalingfactors. Moreover, inspired by the multi-pass scheme of imagedenoising [19], we demonstrate that the SR results can befurther enhanced by cascading multiple SCNs for SR of a fixedscaling factor. The cascade of SCNs (CSCN) can also benefitfrom the end-to-end training of deep network with a speciallydesigned multi-scale cost function.

In practical SR scenarios, the real LR measurements usu-ally suffer from various types of corruptions, such as noiseand blurring. Sometimes the degradation process is even toocomplicated or unclear. We propose several schemes using ourSCN to robustly handle such practical SR cases. When thedegradation mechanism is unknown, we fine-tune the genericSCN with the requirement of only a small amount of realtraining data and manage to adapt our model to the newscenario. When the forward model for LR generation is clear,we propose an iterative SR scheme incorporating SCN withadditional regularization based on priors from the degradationmechanism.

Subjective assessment is important to the SR technol-ogy because the commercial products equipped with suchtechnology are usually evaluated subjectively by the endusers. In order to thoroughly compare our model with otherprevailing SR methods, we conduct a systematic subjectiveevaluation among these methods, in which the assessmentresults are statistically analyzed and one score is given for eachmethod.

In short, the contributions of this paper include:• combining the domain expertise of sparse coding and the

merits of deep learning to achieve better SR performancewith faster training and smaller model size;

• exploring network cascading for arbitrary scaling factorsin order to further enhance SR performance;

• utilizing SCN to robustly handle the practical SR scenar-ios with non-ideal LR measurements.

• conducting a subjective evaluation on a number of recentstate-of-the-art SR methods;

This paper is built upon our previous work in [20] and [21]with several additional contributions. First, we incorporateone more network cascading technique of SCN which furtherimproves the SR performance in [21]. Second, a novel methodof coping with the practical SR problems is presented whichelaborates both of the training and testing schemes. Third,we introduce in detail the system of subjective assessmentand its scoring mechanism. Finally, we provide a more com-prehensive experiment section for qualitative and quantitativeanalysis, which includes extensive experimental results for thepractical SR methods.

Fig. 1. A LISTA network [18] with 2 time-unfolded recurrent stages, whoseoutput α is an approximation of the sparse code of input signal y. The linearweights W , S and the shrinkage thresholds θ are learned from data.

II. RELATED WORK

A. Image SR Using Sparse Coding

The sparse representation based SR method [5] models thetransform from each local patch y ∈ R

my in the bicubic-upscaled LR image to the corresponding patch x ∈ R

mx inthe HR image. The dimension my is not necessarily the sameas mx when image features other than raw pixel is used torepresent patch y. It is assumed that the LR(HR) patch y(x)can be represented with respect to an overcomplete dictionaryDy(Dx ) using some sparse linear coefficients αy(αx ) ∈ R

n ,which are known as sparse code. Since the degradation processfrom x to y is nearly linear, the patch pair can share the samesparse code αy = αx = α if the dictionaries Dy and Dx aredefined properly. Therefore, for an input LR patch y, the HRpatch can be recovered as

x = Dxα, s.t. α = arg minz

‖y − Dy z‖22 + λ‖z‖1, (1)

where ‖·‖1 denotes the �1 norm which is convex and sparsity-inducing, and λ is a regularization coefficient.

In order to learn the dictionary pair (Dy, Dx ), the goal isto minimize the recovery error of x and y, and thus the lossfunction L in [6] is defined as

L = 1

2

(γ ‖x − Dx z‖2

2 + (1 − γ )‖y − Dy z‖22

), (2)

where γ (0 < γ ≤ 1) balances the two reconstructionerrors. Then the optimal dictionary pair {D∗

x , D∗y} can found

by minimizing the empirical expectation of (2) over all thetraining LR/HR pairs,

minDx ,Dy

1

N

N∑i=1

L(Dx , Dy, xi , yi )

s.t. zi = arg minα

‖yi − Dyα‖22 + λ‖α‖1, i = 1, 2, ..., N,

‖Dx (:, k)‖2 ≤ 1, ‖Dy(:, k)‖2 ≤ 1, k = 1, 2, ..., K .

(3)

Since the objective function in (2) is highly nonconvex, thedictionary pair Dy, Dx) is usually learned alternatively whilekeeping the other fixed [6].

B. Network Implementation of Sparse Coding

There is an intimate connection between sparse coding andneural network, which has been well studied in [18] and [22].A feed-forward neural network as illustrated in Fig. 1 isproposed in [18] to efficiently approximate the sparse code α

of input signal y as it would be obtained by solving (1) fora given dictionary Dy . The network has a finite number of

Page 3: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

Fig. 2. Top left: the proposed SCN model with a patch extraction layer H , a LISTA sub-network for sparse coding (with k recurrent stages denoted by thedashed box), a HR patch recovery layer Dx , and a patch combination layer G. Top right: a neuron with an adjustable threshold decomposed into two linearscaling layers and a unit-threshold neuron. Bottom: the SCN re-organized with unit-threshold neurons and adjacent linear layers merged together in thegray boxes.

recurrent stages, each of which updates the intermediate sparsecode according to

zk+1 = hθ (W y + Szk), (4)

where hθ is an element-wise shrinkage function defined as[hθ (a)]i = sign(ai )(|ai | − θi )+ with positive thresholds θ .

Different from the iterative shrinkage and thresholdingalgorithm (ISTA) [23], [24] which finds an analytical rela-tionship between network parameters (weights W , S andthresholds θ ) and sparse coding parameters (Dy and λ),the authors of [18] learn all the network parameters from train-ing data using a back-propagation algorithm called learnedISTA (LISTA). In this way, a good approximation of theunderlying sparse code can be obtained within a fixed numberof recurrent stages.

C. Generic Convolutional Neural Network for SR

As an successful example of deep learning for singleimage SR, Dong et al. [16] propose a fully convolutionalneural network to directly learn the mapping from the inputLR image and the output HR image. It is designed to utilizethree convolutional layers to minic the patch extraction andrepresentation, non-linear mapping and reconstruction of thesparse representation based SR methods, respectively. Due tothe end-to-end training strategy that jointly optimizes all theparameters and the large learning capacity of neural networks,this method notably outperforms its conventional shadowcounterpart.

III. SPARSE CODING BASED NETWORK FOR IMAGE SR

A. Network Architecture

Given the fact that sparse coding can be effectively imple-mented with a LISTA network, it is straightforward to builda multi-layer neural network that mimics the processingflow of the sparse coding based SR method [5]. Same asmost patch-based SR methods, our sparse coding based net-work (SCN) takes the bicubic-upscaled LR image I y as input,and outputs the full HR image I x . Fig. 2 shows the mainnetwork structure, and each of the layers is described in thefollowing.

The input image I y first goes through a convolutionallayer H which extracts feature for each LR patch. Thereare my filters of spatial size sy×sy in this layer, so that ourinput patch size is sy×sy and its feature representation y hasmy dimensions.

Each LR patch y is then fed into a LISTA network witha finite number of k recurrent stages to obtain its sparsecode α ∈ R

n . Each stage of LISTA consists of two linearlayers parameterized by W ∈ R

n×my and S ∈ Rn×n , and

a nonlinear neuron layer with activation function hθ . Theactivation thresholds θ ∈ R

n are also to be updated duringtraining, which complicates the learning algorithm. To restrictall the tunable parameters in our linear layers, we do a simpletrick to rewrite the activation function as

[hθ (a)]i = sign(ai )θi(|ai |/θi − 1)+ = θi h1(ai/θi ). (5)

Eq. (5) indicates the original neuron with an adjustable thresh-old can be decomposed into two linear scaling layers anda unit-threshold neuron, as shown in the top-right of Fig. 2.The weights of the two scaling layers are diagonal matricesdefined by θ and its element-wise reciprocal, respectively.

The sparse code α is then multiplied with HR dictio-nary Dx ∈ R

mx ×n in the next linear layer, reconstructingHR patch x of size sx×sx = mx .

In the final layer G, all the recovered patches are putback to the corresponding positions in the HR image I x .This is realized via a convolutional filter of mx channelswith spatial size sg×sg . The size sg is determined as thenumber of neighboring patches that overlap with the samepixel in each spatial direction. The filter will assign appro-priate weights to the overlapped recoveries from differentpatches and take their weighted average as the final predictionin I x .

As illustrated in the bottom of Fig. 2, after some simplereorganizations of the layer connections, the network describedabove has some adjacent linear layers which can be mergedinto a single layer. This helps to reduce the computation loadas well as redundant parameters in the network. The layers Hand G are not merged because we apply additional nonlinearnormalization operations on patches y and x, which will bedetailed in Sec. VI-A.

Page 4: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3197

Thus, there are totally 5 trainable layers in our network:2 convolutional layers H and G, and 3 linear layers shownas gray boxes in Fig. 2. The k recurrent layers share thesame weights and are therefore conceptually regarded as one.Note that all the linear layers are actually implemented asconvolutional layers applied on each patch with filter spatialsize of 1×1, a structure similar to the network in network [25].Also note that all these layers have only weights but no biases(zero biases).

Mean square error (MSE) is employed as the cost functionto train the network, and our optimization objective can beexpressed as

min�

∑i

‖SC N(I (i)y ; �) − I (i)

x ‖22, (6)

where I (i)y and I (i)

x are the i -th pair of LR/HR training data,and SC N(I y; �) denotes the HR image for I y predicted usingthe SCN model with parameter set �. All the parameters areoptimized through the standard back-propagation algorithm.Although it is possible to use other cost terms that aremore correlated with human visual perception than MSE, ourexperimental results show that simply minimizing MSE leadsto improvement in subjective quality.

B. Advantages Over Previous Models

The construction of our SCN follows exactly each stepin the sparse coding based SR method [5]. If the networkparameters are set according to the dictionaries learned in [5],it can reproduce almost the same results. However, aftertraining, SCN learns a more complex regression function andcan no longer be converted to an equivalent sparse codingmodel. The advantage of SCN comes from its ability to jointlyoptimize all the layer parameters from end to end; while in [5]some variables are manually designed and some are optimizedindividually by fixing all the others.

Technically, our network is also a CNN and it has similarlayers as the CNN model proposed in [16] for patch extractionand reconstruction. The key difference is that we have a LISTAsub-network specifically designed to enforce sparse representa-tion prior; while in [16] a generic rectified linear unit (ReLU)[26] is used for nonlinear mapping. Since SCN is designedbased on our domain knowledge in sparse coding, we areable to obtain a better interpretation of the filter responses andhave a better way to initialize the filter parameters in training.We will see in the experiments that all these contribute tobetter SR results, faster training speed and smaller model sizethan a vanilla CNN.

C. Network Cascade

In this section, we investigate two different network cascadetechniques in order to fully exploit our SCN model in SRapplications.

1) Network Cascade for SR of a Fixed Scaling Factor:First, we observe that the SR results can be further improvedby cascading multiple SCNs trained for the same objectivein (6), which is inspired by the multi-pass scheme in [19].The only difference for training these SCNs is to replace the

Fig. 3. SR results for the “Lena” image upscaled by 4 times.(a) → (b) → (d) represents the processing flow with a single SCN×4 model.(a) → (c) → (e) represents the processing flow with two cascadedSCN×2 models. PSNR is given in parentheses.

bicubic interpolated input by its latest HR estimate, while thetarget output remains the same.

The first SCN plays as a function approximator to modelthe non-linear mapping from the bicubic upscaled image tothe ground-truth image. The following SCN plays as anotherfunction approximator, with the starting point changed toa better estimate: the output of its previous SCN.

In other words, the cascade of SCNs as a whole can beconsidered as a new deeper network having more powerfullearning capability, which is able to better approximate themapping between the LR inputs to the HR counterparts,and these SCNs can be trained jointly to pursue even betterSR performance.

2) Network Cascade for Scalable SR: Like most SR modelslearned from external training examples, the SCN discussedpreviously can only upscale images by a fixed factor. A sep-arate model needs to be trained for each scaling factor toachieve the best performance, which limits the flexibility andscalability in practical use. One way to overcome this difficultyis to repeatedly enlarge the image by a fixed scale until theresulting HR image reaches a desired size. This practice iscommonly adopted in the self-similarity based methods [8],[9], [14], but is not so popular in other cases for the fear oferror accumulation during repetitive upscaling.

In our case, however, it is observed that a cascade ofSCNs trained for small scaling factors can generate evenbetter SR results than a single SCN trained for a large

Page 5: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3198 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

Fig. 4. Training cascade of SCNs with multi-scale objectives.

scaling factor, especially when the target scaling factor islarge (greater than 2). This is illustrated by the example inFig. 3. Here an input image is magnified by ×4 times in twoways: with a single SCN×4 model through the processing flow(a) → (b) → (d); and with a cascade of two SCN×2 modelsthrough (a) → (c) → (e). It can be seen that the input tothe second cascaded SCN×2 in (c) is already sharper andcontains less artifacts than the bicubic×4 input to the singleSCN×4 in (b), which naturally leads to the better final resultin (e) than the one in (d).

To get a better understanding of the above observation,we can draw a loose analogy between the SR processand a communication system. Bicubic interpolation is likea noisy channel through which an image is “transmitted”from LR domain to HR domain. And our SCN model (orany SR algorithm) behaves as a receiver which recovers cleansignals from noisy observations. A cascade of SCNs is thenlike a set of relay stations that enhance signal-to-noise ratiobefore the signal becomes too weak for further transmission.Therefore, cascading will work only when each SCN canrestore enough useful information to compensate for the newartifacts it introduces as well as the magnified artifacts fromprevious stages.

3) Training Cascade of Networks: Taking into accountthe two aforementioned cascade techniques, we can con-sider the cascade of all SCNs as a deeper network (CSCN),in which the final output of the consecutive SCNs of thesame ground truth is connected to the input of the next SCNwith bicubic interpolation in the between. To construct thecascade, besides stacking several SCNs trained individuallywith respect to (6), we can also optimize all of them jointlyas shown in Fig. 4. Without loss of generality, we assumeeach stage in Sec. III-C2 has the same scaling factor s. LetI j,k ( j > 0, k > 0) denote the output image of the j -th SCNin the k-th stage upscaled by a total of ×sk times. In the samestage, each output of SCNs is compared with the associatedground truth image Ik according to the MSE cost, leading toa multi-scale objective function:

min{� j,k}∑

i

∑j

∑k

∥∥∥SC N( I(i)j−1,k; � j,k) − I (i)

k

∥∥∥2

2, (7)

where i denotes the data index, and j, k denotes the SCNindex. For simplicity of notation, I0,k specially denotes thebicubic interpolated image of the final output in the (k −1)-thstage upscaled by a total of ×sk−1 times. This multi-scaleobjective function makes full use of the supervision infor-mation in all scales, sharing a similar idea as heterogeneousnetworks [27]. All the layer parameters {� j,k} in (7) could beoptimized from end to end by back-propagation. The SCNs

share the same training objective can be trained simultane-ously, taking advantage of the merit of deep learning. Forthe SCNs with different training objectives, we use a greedyalgorithm here to train them sequentially from the beginning ofthe cascade so that we do not need to care about the gradient ofbicubic layers. Applying back-propagation through a bicubiclayer or its trainable surrogate will be considered in futurework.

IV. ROBUST SR FOR REAL SCENARIOS

Most of recent SR works generate the LR images forboth training and testing by downscaling HR images usingbicubic interpolation [5], [28]. However, this assumption of theforward model may not always hold in practice. For example,the real LR measurements are usually blurred, or corruptedwith noise. Sometimes, the LR generation mechanism may becomplicated, or even unknown. We now investigate the prac-tical SR problem, and propose two approaches to handle suchnon-ideal LR measurements, using the generic SCN. In thecase that the underlying mechanism of the real LR generationis unclear or complicated, we propose the data-driven approachby fine-tuning the learned generic SCN with a limited numberof real LR measurements as well as their corresponding HRcounterparts. On the other hand, if the real training samplesare unavailable but the LR generation mechanism is clear,we formulate this inverse problem as the regularized HR imagereconstruction problem which can be solved using iterativemethods. The proposed methods demonstrate the robustnessof our SCN model to different SR scenarios. In the following,we elaborate the details of these two approaches, respectively.

A. Data-Driven SR by Fine-Tuning

Deep learning models can be efficiently transferred fromone task to another by re-using the intermediate representationin the original neural network [29]. This method has provensuccessful on a number of high-level vision tasks, even if thereis a limited amount of training data in the new task [30].

The success of super-resolution algorithms usually highlydepends on the accuracy of the model of the imaging process.When the underlying mechanism of the generation of LRimages is not clear, we can take advantage of the aforemen-tioned merit of deep learning models by learning our modelin a data-driven manner, to adapt it for a particular task.Specifically, we start training from the generic SCN modelwhile using very limited amount of training data from a newSR scenario, and manage to adapt it to the new SR scenarioand obtain promising results. In this way, it is demonstratedthat the SCN has the strong capability of learning complexmappings between the non-ideal LR measurements to theirHR counterparts as well as the high flexibility of adapting tovarious SR tasks.

B. Iterative SR With Regularization

The second approach considers the case that the mechanismof generating the real LR images is relatively simple and clear,indicating the training data is always available if we synthesize

Page 6: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3199

LR images with the known degradation process. We proposean iterative SR scheme which incorporates the generic SCNmodel with additional regularization based on task-relatedpriors (e.g. the known kernel for deblurring, or the datasparsity for denoising). In this section, we specifically discusshandling blurred and noisy LR measurements in details asexamples, though the iterative SR methods can be generalizedto other practical imaging models.

1) Blurry Image Upscaling: The real LR images can begenerated with various types of blurring. Directly applying thegeneric SCN model is obviously not optimal. Instead, with theknown blurring kernel, we propose to estimate the regularizedversion of the HR image I x based on the directly upscaledimage I x by the learned SCN as follows:

I x = arg minI

‖I − I x‖2, s.t. D·B · I = I0y (8)

where I0y is the original blurred LR input, and the operators

B and D are blurring and sub-sampling respectively. Similarto the previous work [5], we use back-projection to iterativelyestimate the regularized HR input on which our model canperform better. Specifically, given the regularized estimate

Ii−1x at iteration i − 1, we estimate a less blurred LR image

I i−1y by downsampling I

ix using bicubic interpolation. The

upscaled Iix by learned SCN serves the regularizer for the

i -th iteration as following:

Iix = arg min

I‖I − I

ix‖2

2 + ‖D·B · I − I0y‖2

2 (9)

Here we use penalty method to form an unconstrainedproblem. The upscaled HR image I

ix can be computed as

SC N(I i−1y ,�). The same process is repeated until conver-

gence. We have applied the proposed iterative scheme to LRimages generated from Gaussian blurring and sub-samplingas an example. The empirical performance is illustrated inSec. VI.

2) Noisy Image Upscaling: Noise is a ubiquitous cause ofcorruption in image acquisition. State-of-the-art image denois-ing methods usually adopt priors such as patch similarity [31],patch sparsity [19], [32], or both [33], as regularizer inimage restoration. In this section, we propose a regularizednoisy image upscaling scheme, for specifically handling noisyLR images, in order to obtain improved SR quality. Thoughany denoising algorithm can be used in our proposed scheme,here we apply spatial similarity combined with transformdomain image patch group-sparsity as our regularizer [33],to form the regularized iterative SR problem as an example.

Similar to the method in Sec. IV-B1, we iteratively estimatethe less noisy HR image from the denoised LR image. Given

the denoised LR estimate Ii−1y at iteration i − 1, we directly

upscale it, using the learned generic SCN, to obtain the HR

image Ii−1x . It is then downsampled using bicubic interpola-

tion, to generate the LR image Iiy , which is used in the fidelity

term in the i -th iteration of LR image denoising. The sameprocess is repeated until convergence. The iterative LR image

denoising problem is formulated as follows:{

Iiy,

{αi

}} = arg minI,{αi }

‖I − Iiy‖2

2

+N∑

j=1

{‖W3DG j I − α j ‖2

2 + τ‖α j ‖0

}(10)

where the operator G j generates the 3D vectorized tensor,which groups the j -th overlapping patch from the LR image I ,together with the spatially similar patches within its neighbor-hood by block matching [33]. The codes

{α j

}of the patch

groups in the domain of 3D sparsifying transform W3D aresparse, which is enforced by the l0 norm penalty [34]. Theweight τ controls the sparsity level, which normally dependson the remaining noise level in I i

y [34], [35].In (10), we use the patch group sparsity as our denois-

ing regularizer. The 3D sparsifying transform W3D can becommonly used analytical transforms, such as discrete cosinetransform (DCT) or Wavelets. The state-of-the-art BM3Ddenoising algorithm [33] is based on such an approach, butfurther improved by more sophisticated engineering stages.In order to achieve the best practical SR quality, we demon-strate the empirical performance comparison using BM3D asthe regularizer in Sec. VI. Additionally, our proposed iterativemethod is a general practical SR framework, which is notdedicated to SCN. One can conveniently extend it to other SRmethods, which generates I i

y in i th iteration. The performancecomparison of these methods is illustrated in Sec. VI.

V. SUBJECTIVE EVALUATION PROTOCOL

Subjective perception is an important metric to evaluateSR techniques for commercial use, other than the quantitativeevaluation. In order to more thoroughly compare variousSR methods and quantify the subjective perception, we utilizean online platform for subjective evaluation of SR results fromseveral methods [36], including bicubic, SC [6], SE [9], self-example regression (SER) [37], CNN [16] and CSCN. Eachparticipant is invited to conduct several pair-wise comparisonsof SR results from different methods. The SR methods ofdisplayed SR images in each pair are randomly selected.Ground truth HR images are also included when they areavailable as references. For each pair, the participant needs toselect the better one in terms of perceptual quality. A snapshotof our evaluation web page1 is shown in Fig. 5.

Specifically, there are SR results over 6 images with differ-ent scaling factors: “kid”×4, “chip”×4, “statue”×4, “lion”×3,“temple”×3 and “train”×3. The images are shown in Fig. 6.All the visual comparison results are then summarized intoa 7×7 winning matrix for 7 methods (including groundtruth). A Bradley–Terry [38] model is calculated based tothese results and the subjective score is estimated for eachmethod according to this model. In the Bradley-Terry model,the probability that an object X is favored over Y is assumedto be

p(X � Y ) = esX

esX + esY= 1

1 + esY −sX, (11)

1www.ifp.illinois.edu/~wang308/survey

Page 7: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3200 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

Fig. 5. The user interface of a web-based image quality evaluation, wheretwo images are displayed side by side and local details can be magnified bymoving mouse over the corresponding region.

Fig. 6. The 6 images used in subjective evaluation.

where sX and sY are the subjective scores for X and Y .The scores s for all the objects can be jointly estimated bymaximizing the log likelihood of the pairwise comparisonobservations:

maxs

∑i, j

wi j log

(1

1 + es j−si

), (12)

where wi j is the (i, j)-th element in the winning matrix W ,meaning the number of times when method i is favored overmethod j . We use the Newton-Raphson method to solveEq. (12) and set the score for ground truth method as 1 toavoid the scale ambiguity.

The experiment results are detailed in Sec. VI.

VI. EXPERIMENTS

We evaluate and compare the performance of our modelsusing the same data and protocols as in [28], which arecommonly adopted in SR literature. All our models arelearned from a training set with 91 images, and tested onSet5 [39], Set14 [40] and BSD100 [41] which contain 5, 14and 100 images respectively. We have also trained on otherdifferent larger data sets, and observe marginal performancechange (around 0.1dB). The original images are downsized bybicubic interpolation to generate LR-HR image pairs for bothtraining and evaluation. The training data are augmented withtranslation, rotation and scaling.

A. Implementation Details

We determine the number of nodes in each layer of ourSCN mainly according to the corresponding settings used in

sparse coding [6]. Unless otherwise stated, we use input LRpatch size sy=9, LR feature dimension my=100, dictionarysize n=128, output HR patch size sx=5, and patch aggregationfilter size sg=5. All the convolution layers have a stride of 1.Each LR patch y is normalized by its mean and variance, andthe same mean and variance are used to restore the final HRpatch x. We crop 56×56 regions from each image to obtainfixed-sized input samples to the network, which producesoutputs of size 44×44.

To reduce the number of parameters, we implement the LRpatch extraction layer H as the combination of two layers:the first layer has 4 trainable filters each of which is shiftedto 25 fixed positions by the second layer. Similarly, the patchcombination layer G is also split into a fixed layer which alignspixels in overlapping patches and a trainable layer whoseweights are used to combine overlapping pixels. In this way,the number of parameters in these two layers are reducedby more than an order, and there is no observable loss inperformance.

We employ a standard stochastic gradient descent algorithmto train our networks with mini-batch size of 64. Based on theunderstanding of each layer’s role in sparse coding, we useHarr-like gradient filters to initialize layer H , and use uniformweights to initialize layer G. All the remaining three linearlayers are related to the dictionary pair (Dx , Dy) in sparsecoding. To initialize them, we first randomly set Dx and Dy

with Gaussian noise, and then find the corresponding layerweights as in ISTA [23]:

w1 = C · DTy , w2 = I − DT

y Dy, w3 = (C L)−1 · Dx

(13)

where w1, w2 and w3 denote the weights of the three sub-sequent layers after layer H . L is the upper bound on thelargest eigenvalue of DT

y Dy , and C is the threshold valuebefore normalization. We empirically set L=C=5.

The proposed models are all trained using the CUDA Con-vNet package [13] on a workstation with 12 Intel Xeon2.67GHz CPUs and 1 GTX680 GPU. Training a SCN usuallytakes less than one day. Note that this package is customizedfor classification networks, and its efficiency can be furtheroptimized for our SCN model.

In testing, to make the entire image covered by outputsamples, we crop input samples with overlap and extend theboundary of original image by reflection. Note we shave theimage border in the same way as [16] for objective evalua-tions to ensure fair comparison. Only the luminance channelis processed with our method, and bicubic interpolation isapplied to the chrominance channels, as their high frequencycomponents are less noticeable to human eyes. To achievearbitrary scaling factors using CSCN, we upscale an image by×2 times repeatedly until it is at least as large as the desiredsize. Then a bicubic interpolation is used to downscale it tothe target resolution if necessary.

When reporting our best results in Sec. VI-C, we also usethe multi-view testing strategy commonly employed in imageclassification. For patch-based image SR, multi-view testingis implicitly used when predictions from multiple overlapping

Page 8: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3201

Fig. 7. The four learned filters in the first layer H .

Fig. 8. The PSNR change for ×2 SR on Set5 during training using differentmethods: SCN; SCN with random initialization; CNN. The horizontal dashlines show the benchmarks of bicubic interpolation and sparse coding (SC).

patches are averaged. Here, besides sampling overlappingpatches, we also add more views by flipping and transposingthe patch. Such strategy is found to improve SR performancefor general algorithms at the sheer cost of computation.

B. Algorithm Analysis

We first visualize the four filters learned in the first layer Hin Fig. 7. The filter patterns do not change much from the ini-tial first and second order gradient operators. Some additionalsmall coefficients are introduced in a highly structured formthat capture richer high frequency details.

The performance of several networks during training ismeasured on Set5 in Fig. 8. Our SCN improves significantlyover sparse coding (SC) [6], as it leverages data more effec-tively with end-to-end training. The SCN initialized accordingto (13) can converge faster and better than the same model withrandom initialization, which indicates that the understandingof SCN based on sparse coding can help its optimization.We also train a CNN model [16] of the same size as SCN,but find its convergence speed much slower. It is reportedin [16] that training a CNN takes 8×108 back-propagations(equivalent to 12.5×106 mini-batches here). To achieve thesame performance as CNN, our SCN requires less than 1%back-propagations.

The network size of SCN is mainly determined by thedictionary size n. Besides the default value n=128, we havetried other sizes and plot their performance versus the numberof network parameters in Fig. 9. The PSNR of SCN doesnot drop too much as n decreases from 128 to 64, but themodel size and computation time can be reduced significantly,as shown in Table I. Fig. 9 also shows the performance ofCNN with various sizes. Our smallest SCN can achieve higherPSNR than the largest model (CNN-L) in [42] while onlyusing about 20% parameters.

Fig. 9. PSNR for ×2 SR on Set5 using SCN and CNN with various networksizes.

TABLE I

TIME CONSUMPTION FOR SCN TO UPSCALE THE “BABY” IMAGE FROM

256×256 TO 512×512 USING DIFFERENT DICTIONARY SIZE n

TABLE II

PSNR OF DIFFERENT NETWORK CASCADING SCHEMES ON Set5,EVALUATED FOR DIFFERENT SCALING FACTORS

IN EACH COLUMN

TABLE III

EFFECT OF VARIOUS TRAINING SETS ON THE PSNROF ×2 UPSCALING WITH SINGLE VIEW SCN

Different numbers of recurrent stages k have been testedfor SCN, and we find increasing k from 1 to 3 only improvesperformance by less than 0.1dB. As a tradeoff between speedand accuracy, we use k=1 throughout the paper.

In Table II, different network structures with cascade forscalable SR in Sec. III-C2 (in each row) are compared atdifferent scaling factors (in each column). SCN×a denotes themodel trained with fixed scaling factor a without any cascadetechnique. For a fixed a, we use SCN×a as a basic module andapply it one or more times to super-resolve images for differentupscaling factors, which is shown in each row of Table II.It is observed that SCN×2 can perform as well as the scale-specific model for small scaling factor (1.5), and much betterfor large scaling factors (3 and 4). Note that the cascade ofSCN×1.5 does not lead to good results since artifacts quickly

Page 9: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3202 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

TABLE IV

PSNR (SSIM) COMPARISON ON THREE TEST DATA SETS AMONG DIFFERENT METHODS. RED INDICATES THE BEST AND BLUE INDICATES THE SECONDBEST PERFORMANCE. THE PERFORMANCE GAIN OF OUR BEST MODEL OVER ALL THE OTHERS’ BEST IS SHOWN IN THE LAST ROW

get amplified through many repetitive upscalings. Therefore,we use SCN×2 as the default building block for CSCN, anddrop the notation ×2 when there is no ambiguity. The last rowin Table II shows that a CSCN trained using the multi-scaleobjective in (7) can further improve the SR results for scalingfactors 3 and 4, as the second SCN in the cascade is trainedto be robust to the artifacts generated by the first one.

As shown in [42], the amount of training data plays animportant role in the field of deep learning. In order toevaluate the effect of various amount of data on trainingCSCN, we change the training set from a relatively small setof 91 images (Set91) [28] to two other sets: the 199 out of200 training images2 in BSD500 dataset (BSD200) [41], anda subset of 7,500 images from the ILSVRC2013 dataset [44].A model of exactly the same architecture without any cascadeis trained on each data set, and another 100 images from theILSVRC2013 dataset are included as an additional test set.From Table III, we can observe that the CSCN trained onBSD200 consistently outperforms its counterpart trained onSet91 by around 0.1dB on all test data. However, the per-formance of the model trained on ILSVRC2013 is slightlydifferent from the one trained on BSD200, which shows thesaturation of the performance as the amount of training dataincreases. The inferior quality of images in ILSVRC2013 maybe a hurdle to further improve the performance. Therefore, ourmethod is robust to training data and can benefit marginallyfrom a larger set of training images.

C. Comparison With State of the Arts

We compare the proposed CSCN with other recent SR meth-ods on all the images in Set5, Set14 and BSD100 for differentscaling factors. Table IV shows the PSNR and structuralsimilarity (SSIM) [45] for adjusted anchored neighborhoodregression (A+) [43], CNN [16], CNN trained with largermodel size and much more data (CNN-L) [42], the proposedCSCN, and CSCN with our multi-view testing (CSCN-MV).

2Since one out of 200 training images coincides with one image in Set5,we exclude it from our training set.

We do not list other methods [6], [10], [28], [40], [46] whoseperformance is worse than A+ or CNN-L.

It can be seen from Table IV that CSCN performs con-sistently better than all previous methods in both PSNR andSSIM, and with multi-view testing the results can be furtherimproved. CNN-L improves over CNN by increasing modelparameters and training data. However, it is still not as goodas CSCN which is trained with a much smaller size and ona much smaller data set. Clearly, the better model structure ofCSCN makes it less dependent on model capacity and trainingdata in improving performance. Our models are generallymore advantageous for large scaling factors due to the cascadestructure. A larger performance gain is observed on Set5 thanthe other two test sets because Set5 has more similar statisticsas the training set.

The visual qualities of the SR results generated by sparsecoding (SC) [6], CNN and CSCN are compared in Fig. 10. Ourapproach produces image patterns with shaper boundaries andricher textures, and is free of the ringing artifacts observablein the other two methods.

Fig. 11 shows the SR results on the “chip” image com-pared among more methods including the self-example basedmethod (SE) [9] and the deep network cascade (DNC) [14].SE and DNC can generate very sharp edges on this image,but also introduce artifacts and blurs on corners and finestructures due to the lack of self-similar patches. On thecontrary, the CSCN method recovers all the structures of thecharacters without any distortion.

D. Robustness to Real SR Scenarios

We evaluate the performance of the proposed practical SRmethods in Sec. IV, by providing the empirical results ofseveral experiments for the two aforementioned approaches.

1) Data-Driven SR by Fine-Tuning: The proposed methodin Sec. IV-A is data-driven, and thus the generic SCN canbe easily adapted for a particular task, with a small amountof training samples. We demonstrate the performance of thismethod in the application of enlarging low-DPI scanned docu-ment images with heavy noise. We first obtain several pairs of

Page 10: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3203

Fig. 10. SR results given by SC [6] (first row), CNN [16] (second row) and our CSCN (third row). Images from left to right: the “monarch” image upscaledby ×3; the “zebra” image upscaled by ×3; the “comic” image upscaled by ×3.

LR and HR images by scanning a document under two settingsof 150DPI and 300DPI. Then we fine-tune our generic CSCNmodel using only one pair of scanned images for a few itera-tions. Fig. 13 illustrates the visualization of the upscaled imagefrom the 150DPI scanned image. As shown by the SR resultsin Fig. 13, the CSCN before adaptation is very sensitive to LRmeasurement corruption, so the enlarged texts in (b) are muchmore corrupted than they are in the nearest neighbor upscaledimage (a). However, the adapted CSCN model removes almostall the artifacts and can restore clear texts in (c), which ispromising for practical applications such as quality enhance-ment of online scanned books and restoration of legacydocuments.

2) Regularized Iterative SR: We now show experimentalresults of practical SR for blurred and noisy LR images,using the proposed regularized iterative methods in Sec. IV-B.We first compare the SR performance on blurry imagesusing the proposed method in Sec. IV-B1 with severalother recent methods [47]–[49], using the same test imagesand settings. All these methods are designed for blurryLR input, while our model is trained on sharp LR input.As shown in Table V, our model achieves much betterresults than the competitors. Note the speed of our model is

also much faster than the conventional sparse coding basedmethods.

To test the performance of upscaling noisy LR images,we simulate additive Gaussian noise for the LR input imagesat 4 different noise levels (σ = 5, 10, 15, 20) as the noisyinput images. We compare the practical SR results in Set5obtained from the following algorithms: directly using SCN,our proposed iterative SCN method using BM3D as denoisingregularizer (iterative BM3D-SCN), and fine-tuning SCN withadditional noisy training pairs. Note that knowing the underly-ing corruption model of real LR image (e.g., noise distributionor blurring kernel), one can always synthesizes real trainingpairs for fine-tuning the generic SCN. In other words, oncethe iterative SR method is feasible, one can always apply ourproposed data-driven method for SR alternatively. However,the other way around is not true. Therefore, the knowledge ofthe corruption model of real measurements can be consideredas a stronger assumption, compared to providing real trainingimage pairs. Correspondingly, the SR performances of thesetwo methods are evaluated when both can be applied. We alsoprovide the results of methods directly using another genericSR model: CNN-L [42], and the similar iterative SR methodinvolving CNN-L (iterative BM3D-CNN-L).

Page 11: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3204 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

Fig. 11. The “chip” image upscaled by ×4 times using different methods. (a) Bicubic. (b) SE [9]. (c) SC [6]. (d) DNC [14]. (e) CNN [16]. (f) CSCN.

Fig. 12. The “building” image corrupted by additive Gaussian noise of σ = 10 and then upscaled by ×2 times using different methods. (a) DirectSCN PSNR=24.00dB. (b) Fine-tuning SCN PSNR=27.54dB. (c) Iterative BM3D-SCN PSNR=27.86dB.

The practical SR results are listed in Table VI. We observedthe improved PSNR using our proposed regularized itera-tive SR method over all noise levels. The proposed iter-ative BM3D-SCN achieves much higher PSNR than themethod of directly using SCN. The performance gap (interms of SR PSNR) between iterative BM3D-SCN and directSCN becomes larger, as the noise level increases. Similarobservation can be found in the result comparison of iter-ative BM3D-CNN-L and direct CNN-L. Compared to themethod of fine-tuning SCN, the iterative BM3D-SCN method

demonstrates better empirical performance, with 0.3 dBimprovement on average. The iterative BM3D-CNN-L methodprovides comparable results, compared to the iterativeBM3D-SCN method, which demonstrates that our proposedregularized iterative SCN scheme can be easily extended forother SR methods, and is able to effectively handle noisyLR measurements.

An example of upscaling noisy LR images using the afore-mentioned methods is demonstrated in Fig. 12. Both find-tuning SCN and iterative BM3D-SCN are able to significantly

Page 12: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3205

TABLE V

PSNR OF ×3 UPSCALING ON LR IMAGES WITH DIFFERENT BLURRING KERNELS

Fig. 13. Low-DPI scanned document upscaled by ×4 times using differentmethods. (a) Nearest neighbor. (b) CSCN. (c) Adapted CSCN.

TABLE VI

PSNR VALUES FOR ×2 UPSCALING NOISY LR IMAGES IN Set5BY DIRECTLY USING SCN (DIRECT SCN), DIRECTLY USING

CNN-L (DIRECT CNN-L), SCN AFTER FINE-TUNING ON

NEW NOISY TRAINING DATA (FINE-TUNING SCN),THE ITERATIVE METHOD OF BM3D & SCN

(ITERATIVE BM3D-SCN), AND THE

ITERATIVE METHOD OF BM3D &CNN-L (ITERATIVE BM3D-CNN-L)

suppress the additive noise, while many artifacts induced bynoise are observed in the SR result of direct SCN. It is notablethat the fine-tuning SCN method performs better recoveringthe texture and the iterative BM3D-SCN method is preferablein smooth regions.

E. Subjective Evaluation

We have a total of 270 participants giving 720 pairwisecomparisons over 6 images with different scaling factors,

Fig. 14. Subjective SR quality scores for different methods including bicubic,SC [6], SE [9], SER [37], CNN [16] and the proposed CSCN. The score forground truth result is 1.

which are shown in Fig. 6. Not every participant completedall the comparisons but their partial responses are still useful.

Fig. 14 shows the estimated scores for the 6 SR methodsin our evaluation, with the score for ground truth methodnormalized to 1. As expected, all the SR methods have muchlower scores than ground truth, showing the great challengein SR problem. The bicubic interpolation is significantlyworse than other SR methods. The proposed CSCN methodoutperforms other previous state-of-the-art methods by a largemargin, demonstrating its superior visual quality. It should benoted that the visual difference between some image pairsis very subtle. Nevertheless, the human subjects are able toperceive such difference when seeing the two images side byside, and therefore make consistent ratings. The CNN modelbecomes less competitive in the subjective evaluation than it isin PSNR comparison. This indicates that the visually appealingimage appearance produced by CSCN should be attributed tothe regularization from sparse representation, which can notbe easily learned by merely minimizing reconstruction erroras in CNN.

VII. CONCLUSIONS

We propose a new model for image SR by combiningthe strengths of sparse coding and deep network, and makeconsiderable improvement over existing deep and shallowSR models both quantitatively and qualitatively. Besides pro-ducing good SR results, the domain knowledge in the form ofsparse coding can also benefit training speed and model com-pactness. Furthermore, we investigate the cascade of networkfor both fixed and incremental scaling factors so as to enhance

Page 13: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

3206 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

SR performance. In addition, the robustness to real SR sce-narios is discussed for handling non-ideal LR measurements.More generally, our observation is in line with other recentextensions made to CNN with better domain knowledge fordifferent tasks.

In future work, we will apply the SCN model to otherproblems where sparse coding can be useful. The interactionbetween deep networks for low-level and high-level visiontasks, such as [50], will also be explored.

REFERENCES[1] S. Baker and T. Kanade, “Limits on super-resolution and how to

break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9,pp. 1167–1183, Sep. 2002.

[2] Z. Lin and H.-Y. Shum, “Fundamental limits of reconstruction-basedsuperresolution algorithms under local translation,” IEEE Trans. PatternAnal. Mach. Intell., vol. 26, no. 1, pp. 83–97, Jan. 2004.

[3] R. Fattal, “Image upsampling via imposed edge statistics,” ACM Trans.Graph., vol. 26, no. 3, 2007, Art. no. 95.

[4] H. A. Aly and E. Dubois, “Image up-sampling using total-variationregularization with a new observation model,” IEEE Trans. ImageProcess., vol. 14, no. 10, pp. 1647–1659, Oct. 2005.

[5] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolutionvia sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11,pp. 2861–2873, Nov. 2010.

[6] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dictio-nary training for image super-resolution,” IEEE Trans. Image Process.,vol. 21, no. 8, pp. 3467–3478, Aug. 2012.

[7] X. Gao, K. Zhang, D. Tao, and X. Li, “Image super-resolution withsparse neighbor embedding,” IEEE Trans. Image Process., vol. 21, no. 7,pp. 3194–3205, Jul. 2012.

[8] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a singleimage,” in Proc. ICCV, Sep./Oct. 2009, pp. 349–356.

[9] G. Freedman and R. Fattal, “Image and video upscaling from local self-examples,” ACM Trans. Graph., vol. 30, no. 2, 2011, Art. no. 12.

[10] K. I. Kim and Y. Kwon, “Single-image super-resolution using sparseregression and natural image prior,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 32, no. 6, pp. 1127–1133, Jun. 2010.

[11] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarityconstraints-based structured output regression machine: An approach toimage super-resolution,” IEEE Trans. Neural Netw. Learn. Syst., to bepublished.

[12] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution:A benchmark,” in Proc. ECCV, 2014, pp. 372–386.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in Proc. NIPS, 2012,pp. 1097–1105.

[14] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen, “Deep networkcascade for image super-resolution,” in Proc. ECCV, 2014, pp. 49–64.

[15] Z. Wang et al., “Self-tuned deep super resolution,” in Proc. IEEE Conf.CVPR Workshops, Jun. 2015, pp. 1–8.

[16] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutionalnetwork for image super-resolution,” in Proc. ECCV, 2014, pp. 184–199.

[17] C. Osendorfer, H. Soyer, and P. van der Smagt, “Image super-resolutionwith fast approximate convolutional sparse coding,” in Neural Informa-tion Processing. New York, NY, USA: Springer, 2014, pp. 250–257.

[18] K. Gregor and Y. LeCun, “Learning fast approximations of sparsecoding,” in Proc. ICML, 2010, pp. 399–406.

[19] B. Wen, S. Ravishankar, and Y. Bresler, “Structured overcomplete sparsi-fying transform learning with convergence guarantees and applications,”Int. J. Comput. Vis., vol. 114, no. 2, pp. 137–167, 2015.

[20] Z. Wang et al., Sparse Coding and Its Applications in Computer Vision.Singapore: World Scientific, 2015.

[21] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks forimage super-resolution with sparse prior,” in Proc. CVPR, Dec. 2015,pp. 370–378.

[22] K. Kavukcuoglu, M. Ranzato, and Y. LeCun. (2010). “Fast inferencein sparse coding algorithms with applications to object recognition.”[Online]. Available: http://arxiv.org/abs/1010.3467

[23] I. Daubechies, M. Defrise, and C. De Mol, “An iterative threshold-ing algorithm for linear inverse problems with a sparsity constraint,”Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457,Nov. 2004.

[24] C. J. Rozell, D. H. Johnson, R. G. Baraniuk, and B. A. Olshausen,“Sparse coding via thresholding and local competition in neural circuits,”Neural Comput., vol. 20, no. 10, pp. 2526–2563, 2008.

[25] M. Lin, Q. Chen, and S. Yan. (2013). “Network in network.” [Online].Available: http://arxiv.org/abs/1312.4400

[26] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedBoltzmann machines,” in Proc. ICML, 2010, pp. 807–814.

[27] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S. Huang,“Heterogeneous network embedding via deep architectures,” in Proc.ACM SIGKDD, 2015, pp. 119–128.

[28] R. Timofte, V. De, and L. Van Gool, “Anchored neighborhood regressionfor fast example-based super-resolution,” in Proc. ICCV, Dec. 2013,pp. 1920–1927.

[29] Q. V. Le, “Building high-level features using large scale unsupervisedlearning,” in Proc. IEEE ICASSP, May 2013, pp. 8595–8598.

[30] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferringmid-level image representations using convolutional neural networks,”in Proc. CVPR, Jun. 2014, pp. 1717–1724.

[31] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for imagedenoising,” in Proc. CVPR, Jun. 2005, pp. 60–65.

[32] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.

[33] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising bysparse 3D transform-domain collaborative filtering,” IEEE Trans. ImageProcess., vol. 16, no. 8, pp. 2080–2095, Aug. 2007.

[34] B. Wen, S. Ravishankar, and Y. Bresler, “Video denoising by online3D sparsifying transform learning,” in Proc. IEEE ICIP, Sep. 2015,pp. 118–122.

[35] S. Ravishankar, B. Wen, and Y. Bresler, “Online sparsifying transformlearning—Part I: Algorithms,” IEEE J. Sel. Topics Signal Process.,vol. 9, no. 4, pp. 625–636, Jun. 2015.

[36] Z. Wang, Y. Yang, Z. Wang, S. Chang, J. Yang, and T. S. Huang,“Learning super-resolution jointly from external and internal examples,”IEEE Trans. Image Process., vol. 24, no. 11, pp. 4359–4371, Nov. 2015.

[37] J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based onin-place example regression,” in Proc. CVPR, Jun. 2013, pp. 1059–1066.

[38] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete blockdesigns: I. The method of paired comparisons,” Biometrika, vol. 39,nos. 3–4, pp. 324–345, 1952.

[39] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighborembedding,” in Proc. BMVC, 2012, pp. 135.1–135.10.

[40] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up usingsparse-representations,” in Curves and Surfaces. Heidelberg, Germany:Springer, 2012, pp. 711–730.

[41] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in Proc. ICCV, vol. 2.Jul. 2001, pp. 416–423.

[42] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 38, no. 2, pp. 295–307, Feb. 2015.

[43] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchoredneighborhood regression for fast super-resolution,” in Proc. ACCV, 2014,pp. 111–126.

[44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in Proc. CVPR, Jun. 2009,pp. 248–255.

[45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: From error visibility to structural similarity,” IEEETrans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

[46] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolutionfrom transformed self-exemplars,” in Proc. CVPR, Jun. 2015, pp. 5197–5206.

[47] W. Dong, L. Zhang, and G. Shi, “Centralized sparse representation forimage restoration,” in Proc. ICCV, Nov. 2011, pp. 1259–1266.

[48] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image super-resolutionwith non-local means and steering kernel regression,” IEEE Trans. ImageProcess., vol. 21, no. 11, pp. 4544–4556, Nov. 2012.

[49] X. Lu, H. Yuan, P. Yan, Y. Yuan, and X. Li, “Geometry constrainedsparse coding for single image super-resolution,” in Proc. CVPR,Jun. 2012, pp. 1648–1655.

[50] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang, “Studying verylow resolution recognition using deep networks,” in Proc. CVPR, 2016,pp. 1–9.

Page 14: 3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, …dingliu2/iccv15/tip16.pdf · Index Terms—Image super-resolution, deep neural networks, ... • combining the domain expertise of sparse

LIU et al.: ROBUST SINGLE IMAGE SR VIA DEEP NETWORKS WITH SPARSE PRIOR 3207

Ding Liu (S’15) received the B.S. degree fromthe Chinese University of Hong Kong, Hong Kong,in 2012, and the M.S. degree from the Universityof Illinois at Urbana–Champaign, USA, in 2014,where he is currently pursuing the Ph.D. degreeunder the supervision of Prof. T. S. Huang. Hisresearch experience encompasses using deep learn-ing to solve low-level vision problems, includingimage superresolution, image restoration, and imagedenoising. He has research interests in the broadareas of computer vision, image processing, and

deep learning.

Zhaowen Wang (M’14) received the B.E.and M.S. degrees from Shanghai Jiao TongUniversity, China, in 2006 and 2009, respectively,and the Ph.D. degree in electrical and computerengineering from the University of Illinois atUrbana–Champaign, in 2014. He is currently aResearch Scientist with the Imagination Laboratory,Adobe Systems Inc. His research has been focusedon understanding and enhancing images viamachine learning algorithms, with a special interestin sparse coding and deep learning.

Bihan Wen received the B.Eng. degree in electri-cal and electronic engineering from Nanyang Tech-nological University, Singapore, in 2012, and theM.S. degree in electrical and computer engineer-ing from the University of Illinois at Urbana–Champaign, Urbana, IL, USA, in 2015, where heis currently pursuing the Ph.D. degree. His currentresearch interests include signal and image process-ing, machine learning, sparse representation, and bigdata applications.

Jianchao Yang (M’14) received the B.E. degreefrom the University of Science and Technology ofChina, in 2006, and the Ph.D. degree from the Elec-trical and Computer Engineering Department, Uni-versity of Illinois at Urbana–Champaign, in 2011,under the supervision of Prof. T. S. Huang. He iscurrently a Research Scientist with Snapchat Inc.,Venice, CA. In particular, he has extensive experi-ence in the following research areas, such as imagecategorization, object recognition and detection, andimage retrieval; image and video superresolution,

denoising, and deblurring; face recognition and soft biometrics; sparse codingand sparse representation; and unsupervised learning, supervised learning, anddeep learning. His research interests are in the broad areas of computer vision,machine learning, and image processing.

Wei Han received the B.Eng. and M.S. degreesfrom the Department of Computer Science,Shanghai Jiao Tong University, China, in 2009 and2012, respectively. He is currently pursuing thePh.D. degree with the Department of Electricaland Computer Engineering, University of Illinoisat Urbana–Champaign. His research interestsinclude computer vision with a focus on imageobject detection and recognition, and video eventdetection.

Thomas S. Huang (F’01) received the B.S. degreefrom National Taiwan University, Taipei, Taiwan,and the M.S. and D.Sc. degrees from theMassachusetts Institute of Technology (MIT),Cambridge, all in electrical engineering. He was aFaculty Member with the Department of ElectricalEngineering, MIT, from 1963 to 1973, and a FacultyMember with the School of Electrical Engineeringand the Director of its Laboratory for Informationand Signal Processing with Purdue University from1973 to 1980. In 1980, he joined the University of

Illinois at Urbana–Champaign, where he is a William L. Everitt DistinguishedProfessor of Electrical and Computer Engineering, and a Research Professorwith the Coordinated Science Laboratory, and at the Beckman Institute forAdvanced Science, he is Technology and Co-Chair of the Institutes majorresearch theme Human Computer Intelligent Interaction. His professionalinterests lie in the broad areas of information technology, especially thetransmission and processing of multidimensional signals. He has published21 books, and over 600 papers in network theory, digital filtering, imageprocessing, and computer vision. He is a member of the National Academyof Engineering and the Academia Sinica, China, a Foreign Member ofthe Chinese Academies of Engineering and Sciences, and a fellow of theInternational Association of Pattern Recognition and the Optical Society ofAmerica. Among his many honors and awards, such as the Honda LifetimeAchievement Award, the IEEE Jack Kilby Signal Processing Medal, and theKing-Sun Fu Prize of the International Association for Pattern Recognition.