1 bridging the gap between computational photography and visual recognition … · 2019-01-29 ·...

1

Bridging the Gap Between ComputationalPhotography and Visual Recognition

Rosaura G. VidalMata*, Sreya Banerjee*, Brandon RichardWebster, Michael Albright, Pedro Davalos,Scott McCloskey, Ben Miller, Asong Tambo, Sushobhan Ghosh, Sudarshan Nagesh, Ye Yuan, Yueyu Hu,

Junru Wu, Wenhan Yang, Xiaoshuai Zhang, Jiaying Liu, Zhangyang Wang, Hwann-Tzong Chen,Tzu-Wei Huang, Wen-Chi Chin, Yi-Chun Li, Mahmoud Lababidi, Charles Otto, and Walter J. Scheirer

Abstract—What is the current state-of-the-art for image restoration and enhancement applied to degraded images acquired under lessthan ideal circumstances? Can the application of such algorithms as a pre-processing step to improve image interpretability for manualanalysis or automatic visual recognition to classify scene content? While there have been important advances in the area ofcomputational photography to restore or enhance the visual quality of an image, the capabilities of such techniques have not alwaystranslated in a useful way to visual recognition tasks. Consequently, there is a pressing need for the development of algorithms that aredesigned for the joint problem of improving visual appearance and recognition, which will be an enabling factor for the deployment ofvisual recognition tools in many real-world scenarios. To address this, we introduce the UG2 dataset as a large-scale benchmarkcomposed of video imagery captured under challenging conditions, and two enhancement tasks designed to test algorithmic impact onvisual quality and automatic object recognition. Furthermore, we propose a set of metrics to evaluate the joint improvement of suchtasks as well as individual algorithmic advances, including a novel psychophysics-based evaluation regime for human assessment anda realistic set of quantitative measures for object recognition performance. We introduce six new algorithms for image restoration orenhancement, which were created as part of the IARPA sponsored UG2 Challenge workshop held at CVPR 2018. Under the proposedevaluation regime, we present an in-depth analysis of these algorithms and a host of deep learning-based and classic baselineapproaches. From the observed results, it is evident that we are in the early days of building a bridge between computationalphotography and visual recognition, leaving many opportunities for innovation in this area.

Index Terms—Computational Photography, Object Recognition, Deconvolution, Super-Resolution, Deep Learning, Evaluation

F

1 INTRODUCTION

THe advantages of collecting imagery from autonomousvehicle platforms such as small UAVs are clear. Man-

portable systems can be launched from safe positions topenetrate difficult or dangerous terrain, acquiring hours ofvideo without putting human lives at risk during searchand rescue operations, disaster recovery, and other scenar-ios where some measure of danger has traditionally beena stumbling block. Similarly, cars equipped with visionsystems promise to improve road safety by more reliablyreacting to hazards and other road users compared to hu-mans. However, what remains unclear is how to automatethe interpretation of what are inherently degraded imagescollected in such applications — a necessary measure in theface of millions of frames from individual flights or road

* denotes equal contribution

• Rosaura G. VidalMata, Sreya Banerjee, Brandon RichardWebster, andWalter J. Scheirer are with the University of Notre Dame, Notre Dame,IN, 46556 Corresponding Author’s E-mail: [email protected]

• Michael Albright, Pedro Davalos, Scott McCloskey, Ben Miller, andAsong Tambo are with Honeywell ACST.

• Sushobhan Ghosh is with Northwestern University.• Sudarshan Nagesh is with Zendar Co.• Ye Yuan, Junru Wu, and Zhangyang Wang are with Texas A&M Univer-

sity.• Yueyu Hu, Xiaoshuai Zhang, and Jiaying Liu are with Peking University.• Wenhan Yang is with the National University of Singapore.• Hwann-Tzong Chen, Tzu-Wei Huang, Wen-Chi Chin, and Yi-Chun Li

are with the National Tsing Hua University.• Mahmoud Lababidi is with Johns Hopkins University.• Charles Otto is with Noblis.

trips. A human-in-the-loop cannot manually sift throughdata of this scale for actionable information in real-time.Ideally, a computer vision system would be able to iden-tify objects and events of interest or importance, surfacingvaluable data out of a massive pool of largely uninterestingor irrelevant images, even when that data has been collectedunder less than ideal circumstances. To build such a system,one could turn to recent machine learning breakthroughsin visual recognition, which have been enabled by accessto millions of training images from the Internet [1], [2].However, such approaches cannot be used as off-the-shelfcomponents to assemble the system we desire, because theydo not take into account artifacts unique to the operationof the sensor and optics configuration on an acquisitionplatform, nor are they strongly invariant to changes inweather, season, and time of day.

Whereas deep learning-based recognition algorithms canperform on par with humans on good quality images [3],[4], their performance on distorted samples is degraded. Ithas been observed that the presence of imaging artifactscan severely impact the recognition accuracy of state-of-the-art approaches [5], [6], [7], [8], [9], [10], [11]. Having areal-world application such as a search and rescue drone orautonomous driving system fail in the presence of ambientperturbations such as rain, haze or even motion inducedblur could have unfortunate aftereffects. Consequently, de-veloping and evaluating algorithms that can improve theobject classification of images captured under less than

arX

iv:1

901.

0948

2v1

[cs

.CV

] 2

8 Ja

n 20

19

2

ideal circumstances is fundamental for the implementationof visual recognition models that need to be reliable. Andwhile one’s first inclination would be to turn to the areaof computational photography for algorithms that removecorruptions or gain resolution, one must ensure that theyare compatible with the recognition process itself, and donot adversely affect the feature extraction or classificationprocesses (Fig. 1) before incorporating them into a process-ing pipeline that corrects and subsequently classifies images.

The computer vision renaissance we are experiencinghas yielded effective algorithms that can improve the visualappearance of an image [12], [13], [14], [15], [16], but many oftheir enhancing capabilities do not translate well to recog-nition tasks as the training regime is often isolated fromthe visual recognition aspect of the pipeline. In fact, recentworks [5], [17], [18], [19] have shown that approaches thatobtain higher scores on classic quality estimation metrics(namely Peak Signal to Noise Ratio), and thus, would beexpected to produce high quality images, do not necessarilyperform well at improving or even maintaining the originalimage classification performance. Taking this into consider-ation, we propose to bridge the gap between traditional im-age enhancement approaches and visual recognition tasksas a way to jointly increase the abilities of enhancementtechniques for both scenarios.

In line with the above objective, in this work we in-troduce UG2: a large-scale video benchmark for assessingimage restoration and enhancement for visual recognition.It consists of a publicly available dataset (http://www.ug2challenge.org) composed of videos captured from threedifficult real-world scenarios: uncontrolled videos taken byUAVs and manned gliders, as well as controlled videostaken on the ground. Over 150, 000 annotated frames forhundreds of ImageNet classes are available. From the basedataset, different enhancement tasks can be designed toevaluate improvement in visual quality and automatic ob-ject recognition, including supporting rules that can befollowed to execute such evaluations in a precise and re-producible manner. This article describes the creation of theUG2 dataset as well as the advances in visual enhancementand recognition that have been possible as a result.

Specifically, we summarize the results of the IARPAsponsored UG2 Challenge workshop held at CVPR 2018.The challenge consisted of two specific tasks defined aroundthe UG2 dataset: (1) image restoration and enhancement toimprove image quality for manual inspection, and (2) im-age restoration and enhancement to improve the automaticclassification of objects found within individual images. TheUG2 dataset contains manually annotated video imagery(including object labels and bounding boxes) with an amplevariety of imaging artifacts and optical aberrations (seeFig. 4 in Sec. 4 below); thus it allows for the developmentand quantitative evaluation of image enhancement algo-rithms. Participants in the challenge were able to use theprovided imagery and as much out-of-dataset imagery asthey liked for training and validation purposes. Enhance-ment algorithms were then submitted for evaluation andresults were revealed at the end of the competition period.

The competition resulted in six new algorithms, de-signed by different teams, for image restoration and en-hancement in challenging image acquisition circumstances.

Fig. 1. (Top) In principle, enhancement techniques like the Super-Resolution Convolutional Neural Network (SRCNN) [20] should improvevisual recognition performance by creating higher quality inputs forrecognition models. (Bottom) In practice, this is not always the case,especially when new artifacts are unintentionally introduced, such as inthis application of Deep Video Deblurring [16].

These algorithms included strategies to dynamically es-timate corruption and choose the appropriate response,the simultaneous targeting of multiple artifacts, the abilityto leverage known image priors that match a candidateprobe image, super-resolution techniques adapted from thearea of remote sensing, and super-resolution via Genera-tive Adversarial Networks. This was the largest concertedeffort to-date to develop new approaches in computationalphotography supporting human preference and automaticrecognition. We look at all of these algorithms in this article.

Having a good stable of existing and new restorationand enhancement algorithms is a nice start, but are anyof them useful for the image analysis tasks at hand? Herewe take a deeper look at the problem of scoring suchalgorithms. Specifically, the question of whether or notresearchers have been doing the right thing when it comes toautomated evaluation metrics for tasks like deconvolution,super-resolution and other forms of image artifact removalis explored. We suggest a visual psychophysics-inspiredassessment regime, where human perception is the referencepoint, as an alternative to other forms of automatic andmanual assessment that have been proposed in the litera-ture. Using the methods and procedures of psychophysicsthat have been developed for the study of human vision inpsychology, we can perform a more principled assessmentof image improvement than just a simple A/B test, whichis common in computer vision. We compare this humanexperiment with the recently introduced Learned PerceptualImage Patch Similarity (LPIPS) metric proposed by Zhanget al. [21]. Further, when it comes to assessing the impact ofrestoration and enhancement algorithms on visual recogni-tion, we suggest that the recognition performance numbersare the only metric that one should consider. As we will seefrom the results, much more work is needed before practicalapplications can be supported.

In summary, the contributions of this article are:

http://www.ug2challenge.org

http://www.ug2challenge.org

3

• A new video benchmark dataset representing both idealconditions and common aerial image artifacts, whichwe make available to facilitate new research and tosimplify the reproducibility of experimentation.

• A set of protocols for the study of image enhance-ment and restoration for image quality improvement,as well as visual recognition. This includes a novelpsychophysics-based evaluation regime for human as-sessment and a realistic set of quantitative measures forobject recognition performance.

• An extensive evaluation of the influence of imageaberrations and other problematic conditions on com-mon object recognition models including VGG16 andVGG19 [22], InceptionV3 [23], and ResNet50 [24].

• The introduction of six new algorithms for image en-hancement or restoration, which were created as part ofthe UG2 Challenge workshop held at CVPR 2018. Thesealgorithms are pitted against eight different classicaland deep learning-based baseline algorithms from theliterature on the same benchmark data.

• A series of recommendations on specific aspects ofthe problem that the field should focus its attentionon so that we have a better chance at enabling sceneunderstanding under less than ideal image acquisitioncircumstances.

2 RELATED WORK

Datasets. The areas of image restoration and enhancementhave a long history in computational photography, withassociated benchmark datasets that are mainly used for thequalitative evaluation of image appearance. These includevery small test image sets such as Set5 [13] and Set14 [12],the set of blurred images introduced by Levin et al. [25],and the DIVerse 2K resolution image dataset (DIV2K) [26]designed for super-resolution benchmarking. Datasets con-taining more diverse scene content have been proposedincluding Urban100 [15] for enhancement comparisons andLIVE1 [27] for image quality assessment. While not origi-nally designed for computational photography, the BerkeleySegmentation Dataset has been used by itself [15] and incombination with LIVE1 [28] for enhancement work. Thepopularity of deep learning methods has increased demandfor training and testing data, which Su et al. provide as videocontent for deblurring work [16]. Importantly, none of thesedatasets were designed to combine image restoration andenhancement with recognition for a unified benchmark.

Most similar to the dataset we employ in this paper arevarious large-scale video surveillance datasets, especiallythose which provide a “fixed" overhead view of urbanscenes [29], [30], [31], [32]. However, these datasets areprimarily meant for other research areas (e.g., event/actionunderstanding, video summarization, face recognition) andare ill-suited for object recognition tasks, even if they sharesome common imaging artifacts that impair recognition asa whole.

With respect to data collected by aerial vehicles, theVIRAT Video Dataset [33] contains “realistic, natural andchallenging (in terms of its resolution, background clutter,diversity in scenes)" imagery for event recognition, whilethe VisDrone2018 Dataset [34] is designed for object detec-tion and tracking. Other datasets including aerial imagery

are the UCF Aerial Action Data Set [35], UCF-ARG [36],UAV123 [37], and the multi-purpose dataset introduced byYao et al. [38]. As with the computational photographydatasets, none of these sets have protocols for image restora-tion and enhancement coupled with object recognition.

Visual Quality Enhancement. There is a wide vari-ety of enhancement methods dealing with different kindsof artifacts, such as deblurring (where the objective is torecover a sharp version x′ of a blurry image y withoutknowledge of the blur parameters) [25], [39], [40], [41], [42],[43], [44], [45], denoising (where the goal is the restorationof an image x from a corrupted observation y = x + n,where n is assumed to be noise with variance σ2) [41],[42], [46], [47], compression artifact reduction (which focuseson removing blocking artifacts, ringing effects or otherlossy compression-induced degradation) [48], [49], [50], [51],reflection removal [52], [53], and super-resolution (whichattempts to estimate a high-resolution image from one ormore low-resolution images) [20], [54], [55], [56], [57], [58],[59], [60], [61]. Other approaches designed to deal with at-mospheric perturbations include dehazing (which attemptsto recover the scene radiance J , the global atmosphericlight A and the medium transmission t from a hazy imageI(x) = J(x)t(x)+A(1− t(x))) [62], [63], [64], [65], [66], andrain removal techniques [67], [68], [69], [70].

Most of these approaches are tailored to address a partic-ular kind of visual aberration, and the presence of multipleproblematic conditions in a single image might lead to theintroduction of artifacts by the chosen enhancement tech-nique. Recent work has explored the possibility of handlingmultiple degradation types [71], [72], [73], [74].

Visual Enhancement for Recognition. Intuitively, if animage has been corrupted, then employing restoration tech-niques should improve performance of recognizing objectsin the image. An early attempt at unifying a high-level tasklike object recognition with a low-level task like deblurringwas performed by Zeiler et al. through deconvolutionalnetworks [75], [76]. Similarly, Haris et al. [18] proposedan end-to-end super resolution training procedure that in-corporated detection loss as a training objective, obtainingsuperior object detection results compared to traditionalsuper-resolution methods for a variety of conditions (includ-ing additional perturbations on the low resolution imagessuch as the addition of Gaussian noise).

Sajjadi et al. [19] argue that the use of traditional metricssuch as Peak Signal to Noise Ratio (PSNR), Structural Sim-ilarity Index (SSIM), or the Information Fidelity Criterion(IFC) might not reflect the performance of some models,and propose the use of object recognition performance asan evaluation metric. They observed that methods thatproduced images of higher perceptual quality obtainedhigher classification performance despite obtaining lowPSNR scores. In agreement with this, Gondal et al. [77] ob-served the correlation of the perceptual quality of an imagewith its performance when processed by object recognitionmodels. Similarly, Tahboub et al. [78] evaluate the impactof degradation caused by video compression on pedestriandetection. Other approaches have used visual recognition asa way to evaluate the performance of visual enhancementalgorithm for tasks such as text deblurring [79], [80], imagecolorization [81], and single image super resolution [82].

4

While the above approaches employ object recognitionin addition to visual enhancement, there are approaches de-signed to overlook the visual appearance of the image andinstead make use of enhancement techniques to exclusivelyimprove the object recognition performance. Sharma et al.[83] make use of dynamic enhancement filters in an end-to-end processing and classification pipeline that incorpo-rates two loss functions (enhancement and classification).The approach focuses on improving the performance ofchallenging high quality images. In contrast to this, Yim etal. [10] propose a classification architecture (comprised ofa pre-processing module and a neural network model) tohandle images degraded by noise. Li et al. [84] introduceda dehazing method that is concatenated with Faster R-CNNand jointly optimized as a unified pipeline. It outperformstraditional Faster R-CNN and other non-joint approaches.

Additional work has been undertaken in using visualenhancement techniques to improve high-level tasks such asface recognition [85], [86], [87], [88], [89], [90], [91], [92], [93],[94], [95], [96], [97] (through the incorporation of deblurring,super-resolution, and hallucination techniques) and personre-identification [98] algorithms for video surveillance data.

3 A NEW EVALUATION REGIME FOR IMAGERESTORATION AND ENHANCEMENT

Our goal in this work is to provide insights on the impactof image enhancement algorithms in visual recognitiontasks and find out which of them, in conjunction withthe strongest features and supervised machine learningapproaches, are promising candidates for different prob-lem domains. To support this, we designed two evaluationtasks: (1) enhancement to facilitate manual inspection, wherealgorithms produce enhanced images to facilitate humanassessment, and (2) enhancement to improve object recognition,where algorithms produce enhanced images to improveobject classification by state-of-the-art neural networks.

3.1 Enhancement to Facilitate Manual Inspection

The first task is an evaluation of the qualitative enhance-ment of images. Through this task, we wish to answertwo questions: Did the algorithm produce an enhancementthat agrees with human perceptual judgment? And to whatextent was the enhancement an improvement or deteriora-tion? Widely used proposed metrics such as SSIM [99] haveattempted answer these questions by estimating humanperception but have often failed to accurately imitate thenuances of human visual perception, and at times havecaused algorithms to deteriorate perceptual quality [21].

While prior work has successfully adapted psychophys-ical methods from psychology as a means to directly studyperceptual quality, these methods have been primarilyposed as real-vs-fake tests [100]. With respect to qualitativeenhancement, these real-vs-fake methods can only indicateif an enhancement has caused enough alteration to causehumans to cross the threshold of perception, and provideslittle help in answering the two questions we are interestedin for this task. Zhang et al. [21] came close to answeringthese questions when they proposed LPIPS for evaluatingperceptual similarity. However, this metric lacks the ability

Fig. 2. The visual enhancement task deployed on Amazon MechanicalTurk. An observer is presented with the original image x and the en-hanced image y. The observer is then asked to select which label theyperceive is most applicable. The selected label is converted to an integervalue [1, 5]. The final rating for the enhanced image is the mean scorefrom approximately 20 observers. See Sect. 3.1 for further details.

to measure whether the enhancement was an improvementor a deterioration (see analysis in Sec. 6).

In light of this, for our task we propose a new proce-dure, grounded in psychophysics, to evaluate the visualenhancement of images by answering both of our posedquestions. The procedure for assessing image quality en-hancement is a non-forced-choice procedure that allows usto take measurements of both the threshold of perceivedchange and the suprathresholds, which represent the degreeof perceived change [101]. Specifically, we employ a bipolarlabeled Likert Scale to estimate the amount of improvementor deterioration the observer perceives, once the thresholdhas been crossed. The complete procedure is as follows.

An observer is presented with an image x positionedon the left-hand side of a screen and the output of theenhancement algorithm y on the right. The observer isinformed that x is the original image and y is the enhancedimage. Below the image pair, five labels are provided andthe observer is asked to select the label that most applies(see Fig. 2 for labels and layout). To capture as much of theunderlying complexities in human judgment as possible,no criteria is provided for making a selection. To reduceany dependence on the subsystems in the visual cortex thatspecialize in memory, the pair of images is displayed untilthe observer selects their perceived label. An observer isgiven unlimited time, and informed that providing accuratelabels is most important. For images larger than 480 × 480pixels, the observer has the option to enlarge the images andexamine them in finer detail.

The label that is selected by the observer is then con-verted to an assigned ordinal value 1 − 5 where [1, 3) and(3, 5] are the suprathreshold measurements of improvementand deterioration, respectively. A rating of 3 is the super-ficial threshold imposed on the observer, which indicatesthat the enhancement was imperceptible. In our proposedprocedure, there is no notion of accuracy in the measure-ment of qualitative enhancement, as it is left entirely up tothe observer’s subjective discretion. However, when thereare ≥ 2 sampled observers, the perception of quality tothe average observer can be estimated to provide a reliablemetric for the evaluation of qualitative enhancement. Weverified this holds true even when x and y are swapped(i.e., responses are symmetric).

5

Fig. 3. Work flow for classification improvement: after enhancing theinput frame with the candidate algorithm, the annotated objects in theframe are extracted and sent as input to each of the classificationnetworks, which classify the image. Each algorithm’s score is thencalculated as specified in Sec. 3.2.2.

To perform a large scale evaluation, we used Amazon’sMechanical Turk (AMT) service, which is widely deployedfor many related tasks in computer vision [1], [21], [102].AMT allows a Requester (i.e., researcher) to hire Workersto perform a task for payment. Our task was for a Workerto participate in the rating procedure for 100 image pairs.An additional three sentinel images pairs were given toensure that the Worker was actively participating in therating procedure. Workers who failed to correctly respondto at least two of the three sentinel image pairs were notpaid and their ratings were discarded. In total, we had over600 Workers rating each image enhancement approximately20 times. Out of that pool, ∼ 85.4% successfully classifieda majority of the sentinel image pairs (the ratings providedby the remaining workers were discarded). See Sec. 6 forresults and analysis.

3.2 Enhancement to Improve Object RecognitionThe second task is an evaluation of the performance im-provement enhanced images lead to when used as input tostate-of-the-art image classification networks. When consid-ering a fixed dataset, the evaluation protocol allows for theuse of some within dataset training data (the data providedby the UG2 dataset, described below, for training containsframe-level annotations of the object classes of interest), andas much out of dataset data as needed for training andvalidation purposes. In order to establish good baselines forclassification performance before and after the applicationof image enhancement and restoration algorithms, this taskmakes use of a selection of deep learning approaches torecognize annotated objects and then scores results basedon the classification accuracy. The Keras [103] versions ofthe pre-trained networks VGG16 and VGG19 [22], Incep-tionV3 [23], and ResNet50 [24] are used for this purpose.

Each candidate algorithm is treated as an image pre-processing step to prepare sequestered test images to besubmitted to all four networks. After pre-processing, theobjects of interest are cropped out of the images basedon verified ground-truth coordinates. The cropped imagesare then used as input to the networks. Algorithms areevaluated based on any improvement observed over thebaseline classification result (i.e., the classification scores ofthe un-altered test images). The work flow of this evaluationpipeline is shown in Fig. 3. To avoid introducing furtherartifacts due to down-sampling, we require each algorithmto produce an output frame of the same size as its input.

3.2.1 Classification MetricsThe networks used for the classification task return a listof the ImageNet synsets (ImageNet provides images for

“synsets" or “synonym sets" of words or phrases that de-scribe a concept in WordNet [104]) along with the proba-bility of the object belonging to each of the synset classes.However, (as will be discussed in Sec. 4), in many cases it isimpossible to provide an absolute labeling for the annotatedobjects. Consequently, most of the super-classes that canbe considered are composed of more than one ImageNetsynset. That is, each annotated image i has a single super-class label Li which in turn is defined by a set of ImageNetsynsets Li = {s1, ..., sn}.

To measure accuracy, we observe the number of correctlyidentified synsets in the top-five predictions made by eachpre-trained network. A prediction is considered to be correctif its synset belongs to the set of synsets in the ground-truthsuper-class label. We use two metrics for this. The first mea-sures the rate of achieving at least one correctly classifiedsynset class (M1). In other words, for a super-class labelLi = {s1, ..., sn}, a network is able to place one or morecorrectly classified synsets in the top-five predictions. Thesecond measures the rate of placing all the correct synsetclasses in the super-class label synset set (M2). For example,for a super-class label Li = {s1, s2, s3}, a network is able toplace three correct synsets in the top-five predictions.

3.2.2 ScoringEach image enhancement or restoration algorithm’s per-formance on the classification task is then calculated byapplying one of the two metrics defined above for eachof the four networks and each collection within the UG2

dataset. This results in 12 scores for each metric (i.e., M1or M2 scores from VGG16, VGG19, Inception, and ResNetfor the UAV, Glider, and Ground collections). For the imageenhancement and restoration algorithms we consider in thisarticle, each is ranked against all other algorithms basedon these scores. A score for an enhancement algorithm isconsidered “valid" if it was higher than that of the scoresobtained by evaluating the classification performance ofthe un-altered images. In other words, we only consider ascore valid if it improves upon the baseline classificationtask of classifying the original images. The number of validscores in which an algorithm excels over the others beingevaluated is then counted as its score in points for the task(for a maximum of 24 points, achievable if an algorithmobtained the highest improvement — compared to all othercompetitors — in all possible configurations).

4 THE UG2 DATASET

As a basis for the evaluation regime described above inSec. 3, we collected a new dataset which we call UG2 (UAV,Glider, and Ground). It is available for download at the fol-lowing site: http://www.ug2challenge.org/dataset18.html.The training and test datasets employed in the evaluationare composed of annotated frames from three differentvideo collections (Fig. 4 presents example frames from each).The annotations provide bounding boxes establishing objectregions and classes, which were manually annotated usingthe Vatic tool for video annotation [105]. For running clas-sification experiments the objects were cropped from theframes in a square region of at least 224×224 pixels (a com-mon input size for many deep learning-based recognitionmodels), using the annotations as a guide.

http://www.ug2challenge.org/dataset18.html

6

(a) UAV Collection

(b) Glider Collection

(c) Ground Collection

Fig. 4. Examples of images in the three UG2 collections.

Each annotation in the dataset indicates the position,scale, visibility, and super-class of an object in a video.The need for high-level classes (super-classes) arises fromthe challenge of performing fine-grained object recognitionusing aerial collections, which have a high variability in bothobject scale and rotation. These two factors make it difficultto differentiate some of the more fine-grained ImageNetcategories. For example, while it may be easy to recognizea car from an aerial picture taken from hundreds (if notthousands) of feet above the ground, it might be impossibleto determine whether that car is a taxi, a jeep or a sportscar. Thus we defined super-classes that encompass multiplevisually similar ImageNet synsets, as well as evaluationmetrics that allow for a coarse-grained classification evalu-ation of such cases (see Sec. 3.2.1). The three different videocollections consist of:

(1) UAV Video Collection: Composed of clips recordedfrom small UAVs in both rural and urban areas, the videosin this collection are open source content tagged with aCreative Commons license, obtained from YouTube. Becauseof the source, they have different video resolutions (from600× 400 to 3840× 2026), objects of interest sizes (croppedobjects with sizes ranging from 224 × 224 to 800 × 800),and frame rates (from 12 FPS to 59 FPS). This collection hasdistortions such as glare/lens flare, compression artifacts,occlusion, over/under exposure, camera shaking (presentin some videos that use autopilot telemetry), sensor noise,motion blur, and fish eye lens distortion. Videos with prob-lematic scene/weather conditions such as night/low lightvideo, fog, cloudy conditions and occlusion due to snowfallare also included.

(2) Glider Video Collection: Consists of videos recordedby licensed pilots of fixed wing gliders in both rural andurban areas. The videos have frame rates ranging from25 FPS to 50 FPS, objects of interest sizes ranging from224 × 224 to 900 × 900, and different types of compressionsuch as MTS, MP4 and MOV. The videos mostly presentimagery taken from thousands of feet above ground, furtherincreasing the difficulty of object recognition. Additionally,the scenes contain artifacts such as motion blur, camera

Collection UAV Glider Ground TotalTotal Videos 50 61 178 289Total Frames 434,264 657,455 125,777 1,217,496

Annotated Videos 30 30 136 196Extracted Objects 32,608 31,760 95,096 159,464Super-Classes [1] 31 20 20 37

TABLE 1Summary of the UG2 training dataset

shaking, noise, occlusion (which in some cases is pervasivethroughout the videos, showcasing parts of the glider thatpartially occlude the objects of interest), glare/lens flare,over/under exposure, interlacing, and fish eye lens distor-tion. This collection also contains videos with problematicweather conditions such as fog, clouds and occlusion due torain.

(3) Ground Video Collection: In order to provide someground-truth with respect to problematic image conditions,this collection contains videos captured at ground level withintentionally induced artifacts. These videos capture staticobjects (e.g., flower pots, buildings) at a wide range of dis-tances (30ft, 40ft, 50ft, 60ft, 70ft, 100ft, 150ft, and 200ft), andmotion blur induced by an orbital shaker to generate hori-zontal movement at different rotations per minute (120rpm,140rpm, 160rpm, and 180rpm). Additionally, this collectionincludes videos under different weather conditions (sun,clouds, rain, snow) that could affect object recognition. Weused a Sony Bloggie hand-held camera (with 1280 × 720resolution and a frame rate of 60 FPS) and a GoPro Hero 4(with 1920 × 1080 resolution and a frame rate of 30 FPS),whose fisheye lens introduced further distortion. Further-more, we provide an additional class of videos (resolution-chart) showcasing a 9 × 11 inch 9 × 9 checkerboard gridexhibiting all the aforementioned distances at all intervals ofrotation. The motivation for including this additional classis to provide a reference for camera calibration used forground data and to aid participants in finding the distortionmeasures of the cameras used.

Training Dataset. The training dataset is composed of289 videos with 1, 217, 496 frames, representing 228 Ima-geNet [1] classes extracted from annotated frames from thethree different video collections. These classes are furthercategorized into 37 super-classes encompassing visuallysimilar ImageNet categories and two additional classes forpedestrian and resolution chart images. Furthermore, thedataset contains a subset of 159, 464 object-level annotatedimages and the videos are tagged to indicate problematicconditions. Table 1 summarizes the training dataset.

Testing Dataset. The testing dataset is composed of 55videos with frame-level annotations. Out of the annotatedframes, 8, 910 disjoint frames were selected among the threedifferent video collections, from which we extracted 9, 187objects. These objects are further categorized into 42 super-classes encompassing visually similar ImageNet categories.While most of the super-classes in the testing dataset over-lap with those in the training dataset, there are some classesunique to each. Table 2 summarizes the testing dataset.

5 NOVEL AND BASELINE ALGORITHMS

Six competitive teams participated in the 2018 UG2 Work-shop held at CVPR, each submitting a novel approach for

7

Collection UAV Glider Ground TotalTotal Videos 19 15 21 55

Annotated Frames 2,814 2,909 3,187 8,910Extracted Objects 3,000 3,000 3,187 9,187Super-Classes [1] 28 17 20 42

TABLE 2Summary of the UG2 testing dataset

image restoration and enhancement meant to address theevaluation tasks we described in Sec. 3. In addition, weassessed eight different classical and deep learning-basedbaseline algorithms from the literature.

5.1 Challenge Workshop EntriesThe six participating teams were Honeywell ACST, North-western University, Texas A&M and Peking University,National Tsing Hua University, Johns Hopkins University,and Noblis. Each team had a unique take on the problem,with an approach designed for one or both of the evaluationtasks.

5.1.1 Camera and Conditions-Relevant Enhancements(CCRE)Honeywell ACST’s algorithmic pipeline was motivated bya desire to closely target image enhancements in order toavoid the counter-productive results that the UG2 datasethas highlighted [5]. Fig. 5 illustrates their approach. Of thewide range of image enhancement techniques, there is asmaller subset of enhancements which may be useful fora particular image. To find this subset, the CCRE pipelineconsiders the intersection of camera-relevant enhance-ments with conditions-relevant enhancements. Examplesof camera-relevant enhancements include de-interlacing,rolling shutter removal (both depending on the sensorhardware), and de-vignetting (for fisheye lenses). Exampleconditions-relevant enhancements include de-hazing (whenimaging distant objects outdoors) and raindrop removal.To choose among the enhancements relevant to variousenvironmental conditions and the camera hardware, CCREmakes use of defect-specific detectors.

This approach, however, requires a measure of manualtuning. For the evaluation task targeting human vision-based image quality assessment, manual inspection re-vealed severe interlacing in the glider set. Thus a simpleinterlacing detector was designed to separate each frameinto two fields (comprised of the even and odd image rows,respectively) and compute the horizontal shift needed toregister the two. If that horizontal shift was greater than0.16 pixels, then the image was deemed interlaced, and de-interlacing was performed by linearly interpolating the rowsto restore the full resolution of one of the fields.

For the evaluation task targeting automated object clas-sification, de-interlacing is also performed with the expecta-tion that the edge-type features learned by the VGG networkwill be impacted by jagged edges from interlacing artifacts.Beyond this, a camera and conditions assessment is partiallyautomated using a file analysis heuristic to determine whichof the collections a given video frame came from. Whileinterlacing was the largest problem with the glider images,the ground and UAV collections were degraded by com-pression artifacts. Video frames from those collections wereprocessed with the Fast Artifact Reduction CNN [51].

Fig. 5. The CCRE approach conditionally selects enhancements.

Fig. 6. Network architecture used for MA-CNN image enhancement.

5.1.2 Multiple Artifact Removal CNN (MA-CNN)

The Northwestern team focused their attention on threemajor causes of artifacts in an image: (1) motion blur, (2) de-focus blur and (3) compression algorithms. They observedthat in general, traditional algorithms address inverse prob-lems in imaging via a two step procedure: first by applyingproximal algorithms to enforce measurement constraintsand then by applying natural image priors (sometimesdenoisers) on the resulting output [106], [107]. Recent trendsin inverse imaging algorithms have focused on developinga single algorithm or network to address multiple imag-ing artifacts [108]. These networks are alternately appliedto denoise and deblur the image. Building on the aboveprinciple, the MA-CNN learning-based approach was de-veloped to remove multiple artifacts in an image. A trainingdataset was created by introducing motion, de-focus andcompression artifacts into 126, 000 images from ImageNet.The motion-blur was introduced by using a kernel with afixed length l and random direction θ for each of the imagesin the training dataset. The defocus blur was introduced byusing a Gaussian kernel with a fixed standard variance σ.The parameters {l, σ} were tuned to create a perceptuallyimproved result.

MA-CNN is a fully convolutional neural network archi-tecture with residual skip connections to generate the en-hanced image. The network architecture is shown in Fig. 6.In order to get better visual quality results, a perceptual lossfunction that uses the first four convolutional layers of pre-trained VGG-16 network is incorporated.

8

By default, the output of the MA-CNN contains checker-board artifacts. Since these checkerboard artifacts are peri-odic, they can be removed by suppressing the correspond-ing frequencies in the Fourier-domain. Moreover all images(of the same size) generated with the network have artifactsin a similar region in the Fourier domain. For images ofdiffering sizes, the distance of the center of the artifact fromthe origin is proportional to its size.

5.1.3 Cascaded Degradation Removal Modules (CDRM)

The Texas A&M team observed that independently remov-ing any single type of degradation could in fact undermineperformance in the object recognition evaluation task, sinceother degradations were not simultaneously considered andthose artifacts might be amplified during this process. Con-sequently, they proposed a pipeline that consists of sequen-tially cascaded degradation removal modules to improverecognition. Further, they observed that different collectionswithin the UG2 dataset had different degradation character-istics. As such, they proposed to first identify the incomingimages as belonging to one of the three collections, andthen deploy a specific processing model for each collection.The resulting pipeline, an ensemble of three strategies, isdepicted in Fig. 7. In their model they adopted six differentenhancement modules.

(1) Histogram Equalization balances the distribution ofpixel intensities and increases the global contrast of images.To do this, Contrast Limited Adaptive Histogram Equaliza-tion (CLAHE) is adopted [109]. The image is partitionedinto regions and the histogram of the intensities in each ismapped to a more balanced distribution. As the method isapplied at the region level, it is more robust to locally strongover-/under-exposures and can preserve edges better. (2)Given that removing blur effects is widely found to behelpful in fast-moving aerial cameras, and/or in low lightfilming conditions, Deblur GAN [110] is employed as anenhancement module in which, with adversarial training,the generator in the network is able to transform a blurredimage to a visually sharper one. (3) Recurrent ResidualNet for Super-Resolution was previously proposed in [111].Due to the large distance between objects and aerial cam-eras, low resolution is a bottleneck for recognizing mostobjects from UAV photos. This model is a recurrent residualconvolutional neural network consisting of six layers andskip-connections. (4) Deblocking Net [112] is an auto-encoder-based neural network with dilation convolutionsto remove blocking effects in videos, which was fine-tunedusing the VGG-19 perceptual loss function, after trainingusing JPEG-compressed images. Since lossy video codingfor on-board sensors introduced blocking effects in manyframes, the adoption of the deblocking net was found tosuppress visual artifacts. (5) RED-Net [113] is trained torestore multiple mixed degradations, including noise andlow resolution together. Images with various noise levelsand scale levels are used for training. The network canimprove the overall quality of images. (6) HDR-Net [114]can further enhance the contrast of images to improve thequality for machine and human analysis. This networklearns to produce a set of affine transformations in bilateralspace to enhance the image while preserving sharp edges.

Input

Data source Classifier

Image enhancement

Images

UAV Ground

N/Amodel1 model2

Glider

DeblurGAN

SRCNN

HistEQ

Denoise/SR

HDRNet

REDNet

Fig. 7. The CDRM enhancement pipleline. If the glider set is detected,no action is taken (recognition is deemed to be good enough by default).5.1.4 Tone Mapping Deep Image Prior (TM-DIP)The main idea of the National Tsing Hua University team’sapproach was to derive deep image priors for enhancingimages that are captured from a specific scene with certainpoor imaging conditions, such as the UG2 collections. Theyconsider the setting that the high-quality counterparts of thepoor-quality input images are unavailable, and hence it isnot possible to collect pairwise input/output data for end-to-end supervised training to learn how to recover the sharpimages from blurry ones.

The method of deep image prior presented by Ulyanovet al. [115] can reconstruct images without using informationfrom ground-truth sharp images. However, it usually takesseveral minutes to produce a prior image by training anindividual network for each image. Thus a new method wasdesigned to replace the per-image prior model of [115] bya generic prior network. This idea is feasible since imagestaken in the same setting, e.g., the UG2 videos, often sharesimilar features. It is not necessary to have a distinct priormodel for each image. One can learn a generic prior net-work that takes every image as the input and generates itscorresponding prior as the output.

At training time, the method from [115] is used togenerate image pairs {(I, V )} for training a generic priornetwork, where I is an original input image and V isits corresponding prior image. The generic prior networkadopts an encoder-decoder architecture with skip connec-tions as in [116]. At inference time, given a new image,its corresponding prior image is efficiently obtained fromthe learned generic prior network, with tone mapping thenapplied to enhance the details.

It was observed that the prior images obtained by thelearned generic prior network usually preserve the signifi-cant structure of the input images but exhibit fewer details.This observation, therefore, led to a different line of thoughton the image enhancement problem. By comparing the priorimage with the original input image, details for enhance-ment may be extracted. Thus, the tone mapping techniquepresented in [117] was used to enhance the details:

I =

(I

V

)γ(V ) , (1)

where I is the input image, V refers to the prior image,the ratio I/V can be considered as the details, and γ isa factor for adjusting the degree of detail-enhancement.With the tone-mapping function in Eq. (1), the local details

9

Filters Kernel Shape Activation128 5x5 LeakyRelu128 1x1 LeakyRelu128 1x1 LeakyRelu128 1x1 LeakyRelu4 1x1 Relu

TABLE 3The SSR network layers.

are detached from the input image, and the factor γ issubsequently adjusted to obtain an enhanced image I .

5.1.5 Satellite Images Super-Resolution (SSR)The team from Johns Hopkins University proposed a neuralnetwork-based approach to apply super-resolution on im-ages. They trained their model on satellite imagery, whichhas an abundance of detailed features. Their network is fullyconvolutional, and takes as input an image of any resolutionand outputs an image that is exactly double the originalinput in width and height.

The network is constrained to 32×32 pixel patches of theimage with an “apron" of 2 pixels for an overlap. This resultsin a 64× 64 output where the outer 4 pixels are ignored, asthey are the apron — they mirror the edge to “pad" theimage. These segments are then stitched together to formthe final image. The network consists of five convolutionallayers; see Table 3 for details.

Most of the network’s layers contain 1 × 1 kernels, andhence are just convolutionalized fully connected layers. Thisnetwork structure is appropriate for a super-resolution taskbecause it can be equated to a regression problem wherethe input is a 25 dimension (5 × 5) vector leading to 4dimensional (2 × 2) vector. The first convolutional layer isnecessary to maintain the spatial relationships of the visualfeatures through the 5× 5 kernel.

The SpaceNet dataset [118] is used to train this network,and is derived from satellite-based images. Images weredownsampled and paired with the originals. Training tookplace for 20 epochs using an L2 + L1 combined loss and theAdam optimizer in Keras/Tensorflow [103].

5.1.6 Style-Transfer Enhancement Using GANs (ST-GAN)Noblis attempted a style-transfer approach for improvingthe quality of the UG2 imagery. Since the classificationnetworks used in the UG2 evaluation protocol were alltrained on ImageNet, a CycleGAN [119] variant was trainedto translate between the UAV and drone collections andImageNet, using LSGAN [120] losses. The architecture wasbased on the original CycleGAN paper, with modifiedgenerators adding skip connections between same spatialresolution convolutional layers on both sides of the residualblocks (in essence a U-Net [116] style network), whichappeared to improve retention of details in the outputimages. The UG2 to ImageNet generator was also madeto perform 4× upscaling (by adding two 0.5 strided con-volutional layers after the first convolutional layer), and itwas also made to perform 4× downscaling (by adding twostride 2 convolutional layers after the first convolutionallayer). The discriminators were left unmodified. Networkswere trained using 128 × 128 patches selected from theUG2 images, and ImageNet images cropped and resized to

512 × 512. UG2 patches were selected by randomly sam-pling regions around the ground-truth annotation boundingboxes, mainly to avoid accidentally sampling flat-coloredpatches.

However, several problems were initially encounteredwhen optimizing the network. Optimization would failoutright, unless it employed some form of normalization.However, it was not possible to employ the typical batch-norm [121], due to small batch sizes. Instead, instancenormalization [122] was initially used, which proved rea-sonably effective. However, even after avoiding immediateoptimization failure, training a network with just the cycle-consistency and GAN losses can lead to mode failures suchas both generators performing color inversion within a fewthousand iterations. Adding the identity mapping losses(i.e., loss terms for G(X)−X , and F (Y )− Y ) discussed inthe original CycleGAN paper proved effective in avoidingthese kinds of failures.

Since the UG2 evaluation protocol specifies the enhance-ment of full video frames, either a larger input to thegenerator must be used (which seemed feasible consideringa fully-convolutional architecture), or the input image mustbe divided into tiles. In either case, instance norm led topoor results on tiles — there was a clear boundary betweentiles due to different local statistics, and on full frames theoutput images had a skewed color distribution due to adifference in statistics over different sized inputs. To combatthis, instance norm was replaced with an operation thatperformed normalization independently down the channelsof each pixel. This stabilized convergence, and did not causeproblems when tiling out large images.

5.2 Baseline Algorithms

Here we describe a number of other algorithms we usedas a pre-processing step to the recognition models. Theseserve as canonical references or baselines against which thealgorithms in Sec. 5.1 were tested. We used both classicalmethods and state-of-the-art deep learning-based methodsfor image interpolation [123], super-resolution [20], [60], anddeblurring [124], [125]

Classical Methods. For image enhancement, we usedthree different interpolation methods (bilinear, bicubic andnearest neighbor) [123] and a single restoration algorithm(blind deconvolution [124]). The interpolation algorithmsattempt to obtain a high resolution image by up-samplingthe source low-resolution image and by providing the bestapproximation of a pixel’s color and intensity values de-pending on the nearby pixels. Since they do not need anyprior training, they can be directly applied to any image.Nearest neighbor interpolation uses a weighted average ofthe nearby translated pixel values in order to calculate theoutput pixel value. Bilinear interpolation increases the num-ber of translated pixel values to two and bicubic interpola-tion increases it to four. Different from image enhancement,in image restoration the degradation, which is the productof motion or depth variation from the object or the camera,is modelled. The blind deconvolution algorithm can be usedeffectively when no information about the degradation (blurand noise) is known [126]. The algorithm restores the imageand the point-spread function (PSF) simultaneously. We

10

used Matlab’s blind deconvolution algorithm, which decon-volves the image using the maximum likelihood algorithm,with a 3× 3 array of 1s as the initial PSF.

Deep Learning-Based Methods. With respect to state-of-the-art deep learning-based super-resolution algorithms, wetested the Super-Resolution Convolutional Neural Network(SRCNN) [20] and Very Deep Super Resolution (VDSR) [60].The SRCNN method employs a feed-forward deep CNNto learn an end-to-end mapping between low resolutionand high resolution images. The network was trained on 5million “sub-images" generated from 395, 909 images of theILSVRC 2013 ImageNet detection training partition [1]. TheVDSR algorithm [60] outperforms SRCNN by employinga deeper CNN inspired by the VGG architecture [22] anddecreases training iterations and time by employing residuallearning with a very high learning rate for faster conver-gence. Unlike SRCNN, the network is capable of handlingdifferent scale factors.

With respect to deep learning-based image restoration al-gorithms, we tested Deep Dynamic scene Deblurring [125],which was designed to address camera shake blur. How-ever, the results presented in [125] indicated that thismethod can obtain good results for other types of blur.The algorithm employs a CNN that was trained with videoframes containing synthesized motion blur such that it re-ceives a stack of neighboring frames and returns a deblurredframe. The algorithm allows for three types of frame-to-frame alignment: no alignment, optical flow alignment,and homography alignment. For our experiments we usedoptical flow alignment, which was reported to have the bestperformance with this algorithm. We had originally evalu-ated an additional video deblurring algorithm proposed bySu et al. [16]. However, this algorithm employs informationacross multiple consecutive frames to perform its deblurringoperation. Given that the training and testing partitions ofthe UG2 dataset consist of disjoint video frames, we omittedthis method to provide a fair comparison.

With respect to image enhancement specifically to im-prove classification, we tested the recently released algo-rithm by Sharma et al. [83]. This approach learns a dynamicimage enhancement network with the overall goal to im-prove classification, but not necessarily human perceptionof the image. The proposed architecture enhances imagefeatures selectively, in such a way that the enhanced featuresprovide a valuable improvement for a classification task.High quality (i.e., free of visual artifacts) images are used totrain a CNN to learn a configuration of enhancement filtersthat can be applied to an input image to yield an enhancedversion that provides better classification performance.

6 RESULTS & ANALYSIS

In the following analysis, we review the results that cameout of the UG2 Workshop held at CVPR 2018, and discussadditional results from the slate of baseline algorithms.

6.1 Enhancement to Facilitate Manual Inspection (UG2

Evaluation Task 1)There is a recent trend to use deep features, as measured bythe difference in activations from the higher convolutionallayers of a pre-trained network for the original and recon-structed images, as a perceptual metric — the motivation

being deep features somewhat mimic human perception.Zhang et al. [21] evaluate the usability of these deep featuresas a measure of human perception. Their goal is to find aperceptual distance metric that resembles human judgment.The outcome of their work is the LPIPS metric, which mea-sures the perceptual similarity between two images rangingfrom 0, meaning exactly similar, to 1, which is equivalentto two completely dissimilar images. Here we compare thismetric (i.e., similarity between the original image and theoutput of the evaluated enhancement algorithms) directlyto human perception, which we argue is the better referencepoint for such assessments.

We used the most current version (v.0.1) of the LPIPSmetric with a pre-trained, linearly calibrated AlexNetmodel. As can be observed in Fig. 8, the four novel algo-rithms that were submitted by participants for the first UG2

evaluation task have very heterogeneous effects on differentimages of the dataset, with LPIPS scores ranging all theway from 0 (no perceptual dissimilarity) to 0.4 (moderatedissimilarity). This effect is accentuated for the images inthe UAV Collection (Fig. 8a), which yields more variancein LPIPS scores, whereas LPIPS scores for the remainingtwo collections (Fig. 8b, 8c) remained between 0 and 0.15for most of these algorithms. We observed a similar effectfor some of our baseline algorithms (Fig. 9a), particularlyfor Blind Deconvolution (BD), which sharpened images, butalso amplified artifacts that were already present.

However, we observed that for both the participantand baseline algorithms, the images human raters consid-ered as having a high level of improvement tended toalso have low LPIPS scores (usually between 0 and 0.15),while images with higher LPIPS scores tended to be ratednegatively by human observers (see Figs. 8 and 9a). Asimilar behaviour was observed by Zu et al. [127]. Theysuggest that high LPIPS scores might indicate the presenceof unnatural looking images. Contrasting the results of thetwo best participant and baseline algorithms (Fig. 9b), weobserved that for the baseline algorithms and the CCREapproach the LPIPS and human rating distributions weremore tightly grouped than MA-CNN. Nevertheless, theirchanges to images tended to be considered small by bothhuman raters and the LPIPS metric, with a user rating closerto 3.0 (no change) and LPIPS scores between 0 and 0.1. Incontrast, changes induced by the MA-CNN method reachedextremes of 0.36 and 2.35 for LPIPS score and human ratingrespectively, flagging the presence of very noticeable, but insome cases detrimental, changes. This is a further constraintof the LPIPS metric.

It is important to note that while we calculated the meanuser rating of all the workshop participant submissions andbaseline algorithms, it was not possible to obtain the LPIPSscores for any of the super-resolution approaches. Thiswould have required us to down-sample enhanced imagesto be of the same size as that of the original images, whichwould have negated the improvement of such methods.

Focusing just on the image improvement / deteriorationas perceived by human raters, we can turn to Fig. 10afor the performance of all algorithms, including the super-resolution approaches, submitted by each team. It is impor-tant to note that while most of the algorithms tended to im-prove the visual quality of the images they were presented

11

(a) UAV Collection (b) Glider Collection (c) Ground Collection

Fig. 8. Distribution of LPIPS similarity between original and enhanced image pairs and the human perceived improvement / deterioration for eachof the collections. Images human raters considered as having a high level of improvement tended to also have low LPIPS scores, while images withhigher LPIPS scores tended to be rated negatively by human observers.

with, a large fraction of the images they enhanced tendedto have an average score between 3 and 3.25. This meansthat the human raters were not able to find any significantdifference between the enhanced and original image pairs.

The best performing algorithm submitted for this task,CCRE, was able to improve the visual quality of 126 images,even though the algorithm did not appear to perform anysignificant changes to most of the images (144 images had arating between 3 and 3.25). The enhancement applied wasconsidered a subtle improvement in most scenarios: 114images had a score between 2.75 and 3, with the remaining12 having a higher improvement score between 2.5 and2.75. However, only 10% of the modified images wereconsidered to degrade the image quality, and even then theyhad a rating of between 3.25 and 3.5, which means that the

(a) Baseline Algs.

(b) Top Participant vs. Baseline Algs.

Fig. 9. Distribution of LPIPS similarity between original and enhancedimage pairs of the test dataset and the human perceived improvement /deterioration for all of the collections within the UG2 dataset.

Network UAV Glider GroundVGG16-M1 24.33% 48.40% 47.35%VGG16-M2 5.17% 7.10% 38.78%VGG19-M1 24.90% 47.93% 54.50%VGG19-M2 5.47% 6.50% 45.21%

Inception-M1 31.00% 55.87% 55.26%Inception-M2 7.23% 6.17% 49.67%ResNet-M1 28.77% 51.33% 65.23%ResNet-M2 5.87% 6.63% 52.06%

TABLE 4Classification scores for the un-altered images of the testing dataset.

The evaluated algorithms were expected to improve these scores.

degradation was very small.As mentioned previously, the visual changes generated

by the runner-up enhancement algorithm MA-CNN weremore explicit than those present in CCRE. While the numberof images that were considered to be improved was smaller(59 improved images), 13 of them were between the rangeof 2.25 and 2.75, indicating a good measure of higher visualquality. Nevertheless, the sharp changes introduced by thisalgorithm also seemed to increase the perceived degradationon a larger portion of the images, with almost 25% havinga score between 3.5 and 3.75 (thus indicating a significantdeterioration of the image quality).

With respect to the baseline algorithms, we observed aless dramatic perception of quality degradation. Given thatmost of the algorithms tested were focused on enhancingthe image resolution (by performing image interpolation orsuper-resolution), they were more prone to perform verysubtle changes on the structure of the image. This is reflectedin Fig. 10b. Fig. 10c shows a side by side comparison of thetwo best baselines (VDSR and Nearest Neighbor interpola-tion) and the two best performing participant submissions.

6.2 Evaluation of Object Recognition Performance(UG2 Evaluation Task 2)

For this evaluation task the participants were expected toprovide enhancement techniques catering to machine visionrather than human perception. Table 4 and Fig. 11 depictthe baseline classification results for the UG2 training andtesting datasets, without any restoration or enhancementalgorithm applied, at rank 5. Given the very poor qualityof its videos, the UAV Collection proved to be the most

12

(a) Participant Algs. (b) Baseline Algs. (c) Top Participant vs. Top Baseline Algs.

Fig. 10. Comparison of perceived visual improvement for all collections after applying enhancement algorithms.

challenging for all networks in terms of object classification,leading to the lowest classification performance out of thethree collections. While the Glider Collection shares similarproblematic conditions with the UAV Collection, the imagesin this collection lead to a higher classification rate thanthose in the UAV Collection in terms of identifying atleast one correctly classified synset class (metric M1). Thisimprovement might be caused by the limited degree ofmovement of the gliders, since it ensures that the displace-ment between frames was kept more stable over time, aswell as a higher recording quality (taking into considerationthe camera weight limitations present in small UAVs areno longer a limiting factor for this collection’s videos). Thecontrolled Ground Collection yielded the highest classifi-cation rates, which, in an absolute sense, are still low (thehighest classification rate being 65.23% for metric M1 and52.06% for metric M2 for the testing dataset; see Table 4).The participants of the UG2 workshop were expected todevelop algorithms to improve upon the baseline scores.

While there are some correlations between the improve-ment of visual quality as perceived by humans and highlevel tasks such as object classification or face recognition

30 40 50 600

10

20

30

40

50

M1 classification rate [%]

M2

clas

sific

atio

nra

te[%

]

Baseline Classification

VGG16 VGG19 Inception ResNet

UAVTraining GliderTraining GroundTrainingUAVTesting GliderTesting GroundTesting

Fig. 11. Classification rates at rank 5 for the original, un-processed,frames for each collection in the training and testing datasets.

performed by networks, research by Sharma et al. [83]suggests that image enhancement focused on improvingimage characteristics valuable for object recognition can leadto an increase in classification performance. Thus we triedsuch a technique as an initial experiment and compared itto the baseline results. Fig. 12 shows the performance ofthe five enhancement filters proposed by Sharma et al. onour dataset. It is important to note that said filters werethe un-altered filters Sharma et al. — trained making use ofgood quality images. This is because that work was focusedon improving the classification performance of images withfew existing perturbations. As such, the effect on improv-ing highly corrupted images is much different from thatobtained on a standard dataset of images crawled from theweb. The results in Fig. 12 establish that even existing deeplearning networks designed for this task cannot achievegood classification rates for UG2 due to the domain shiftin training.

As can be observed in Table 5 and Fig. 13, the fournovel algorithms submitted by participants for the secondUG2 evaluation task excelled in the processing of certaincollections while falling short in others. Most of the sub-mitted algorithms were able to improve the classificationperformance of the images in the Ground Collection, butthey struggled in improving the classification for the aerialcollections, whose scenes tend to have a higher degree ofvariability than those present in the Ground Collection.Only the CCRE algorithm was able to improve the perfor-mance of one of the metrics for the UAV Collection (theM2 metric for the ResNet network, with a 0.20% improve-ment over the baseline). The MA-CNN algorithm was ableto improve two of the Glider Collection metrics (the M2metric for the VGG16 and VGG19 networks with 2.23%and 3.10% improvement respectively over the baselines).For the most part algorithms tended to improve the metricsfor the Ground Collection, with the highest classificationimprovement being provided by the CDRM method, withan improvement of 5.30% and 5.21% over the baselines forthe Inception M1 and M2 metrics.

Further along these lines, while both metrics saw someimprovement, the M2 metric benefited the most from theseenhancement algorithms. This behavior is more pronouncedwhen examined in the context of classic vs. state-of-the-art algorithms (Fig. 14). While the baseline enhancement

13

20 25 300

2

4

6

8


M2

clas

sific

atio

nra

te[%

]

UAV Collection

(a)

40 500

2

4

6

8


M2

clas

sific

atio

nra

te[%

]

Glider Collection

(b)

30 40 50 600

20

40


M2

clas

sific

atio

nra

te[%

]

Ground Collection

(c)

VGG16 VGG19 Inception ResNet BaselineBilateral Filtering Guided Filters Histogram Equalization Image sharpening Weighted Least Squares

Fig. 12. Comparison of classification rates at rank 5 for each collection after applying classification driven image enhancement algorithms by Sharmaet al. [83]. Markers in red indicate results on original images.

Metric UAV Glider GroundVGG16-M1 – – –VGG16-M2 – MA-CNN

(9.33%)TM-DIP(42.42%)

VGG19-M1 – – CCRE(54.53%)

VGG19-M2 – MA-CNN(9.60%)

CDRM(48.16%)

Inception-M1

– – CDRM(60.56%)

Inception-M2

– – CDRM(54.88%)

ResNet-M1 – – CCRE(65.45%)

ResNet-M2 CCRE(6.07%)

– CCRE(53.78%)

TABLE 5Highest classification rates for the evaluated participant enhancement

algorithms that improved the classification performance of theun-altered test images. Improvement was not achieved in all cases.

algorithms had moderate improvements in both metrics, theparticipant algorithms seemed to favor the M2 metric overthe M1 metric. For example, while the highest improvement(on the Ground Collection) for the VGG19 network in M1was 0.03%, the improvement for the same network in M2was 2.95%. This leads us to conclude that the effect thesealgorithms have on automatic object recognition would bethat of increasing within-class classification certainty. Inother words, they would make an object belonging to asuper-class Li = {s1, ..., sn} become a better representationof such class features, such that the networks are able todetect more members of that class in their top 5 predictions.

7 DISCUSSION

The area of computational photography has a long standinghistory of generating algorithms to improve image quality.However, the metrics those algorithms tend to optimize donot correspond to human perception. And as we found out,this makes them unreliable for the task of object recog-nition. Given this, how do we design an enhancement &

recognition pipeline that simultaneously corrects imagingartifacts and recognizes objects of interests? Skeptics mayargue that re-training or fine-tuning existing recognitionnetworks can solve the problem. As promising as this maysound, training a network involves a large cost in terms oftime and resources. And if we were to train these networkseach time for a new task or dataset, we would be ignoringother paths to solving this problem. The UG2 dataset pavesthe way for this new research through the introduction ofa benchmark that is well matched to the problem area. Thechallenge is to jointly optimize two seemingly divergent butrelevant tasks of (1) image enhancement and restoration and(2) object recognition. The first iteration of the challengeworkshop making use of this dataset saw participation ofteams from around the globe and introduced six uniquealgorithms to bridge the the gap between computationalphotography and recognition, which we have described inthis article. As noted by some participants and in accordancewith our initial results, the problem is still not solved —improving image quality using existing techniques does notnecessarily solve the recognition problem.

The results of our experiments led to some surprises.Even though the restoration and enhancement algorithmstended to improve the classification results for the diverseimagery included in our dataset, no approach was ableto uniformly improve the results for all of the candidatenetworks. Moreover, in some cases, performance degradedafter image pre-processing, particularly for frames withhigher amounts of image aberrations. This highlights theoften single focus nature of image enhancement algorithms,which tend to specialize in removing a specific kind ofartifact from an otherwise good quality image, which mightnot always be the present. Some of the algorithmic advance-ments (e.g., MA-CNN and CDRM) developed as a productof this challenge seek to address the problem by incorporat-ing techniques such as deblurring, denoising, deblocking,and super-resolution into a single pre-processing pipeline.Such a practice pointed out the fact that while the indi-

14

20 25 300

2

4

6

8

10


M2

clas

sific

atio

nra

te[%

]

UAV Collection

(a)

40 45 50 550

2

4

6

8

10


M2

clas

sific

atio

nra

te[%

]

Glider Collection

(b)

40 50 60

40

50

60


M2

clas

sific

atio

nra

te[%

]

Ground Collection

(c)

VGG16 VGG19 Inception ResNetBaseline CCRE CDRM MA-CNN TM-DIP

Fig. 13. Comparison of classification rates at rank 5 for each collection after applying the four algorithms submitted by teams for this task.

25 300

2

4

6

8

10


M2

clas

sific

atio

nra

te[%

]

UAV Collection

(a)

45 50 550

2

4

6

8

10


M2

clas

sific

atio

nra

te[%

]

Glider Collection

(b)

40 50 60

40

50

60

M1 classification rate [%]M

2cl

assi

ficat

ion

rate

[%]

Ground Collection

(c)

VGG16 VGG19 Inception ResNetBaseline SRCNN Bilinear CCRE CDRM

Fig. 14. Performance comparison between the two best performing participant algorithms and two best baseline enhancement algorithms.

vidual implementation of some of these techniques mightbe detrimental to the visual quality or visual recognitiontask, when applied in conjunction with other enhancementtechniques their effect turned out to be beneficial for both ofthese objectives.

We also found out that image quality is a subjective as-sessment and better left to humans who are physiologicallytuned to notice higher variations and artifacts in images asa result of evolution. Based on this observation, we devel-oped a psychophysics-based evaluation regime for humanassessment and a realistic set of quantitative measures forobject recognition performance. The code for conductingsuch studies will be made publicly available following thepublication of this article.

Inspired by the success of the UG2 challenge workshopheld at CVPR 2018, we intend to hold subsequent iterationswith associated competitions based on the UG2 dataset.These workshops will be similar in spirit to the PASCAL

VOC and ImageNet workshops that have been held overthe years and will feature new tasks, extending the reachof UG2 beyond the realm of image quality assessment andobject classification.ACKNOWLEDGMENTS

Funding for this work was provided under IARPA contract#2016-16070500002, and NSF DGE #1313583. This workshopis supported in part by the Office of the Director of Na-tional Intelligence (ODNI), Intelligence Advanced ResearchProjects Activity (IARPA). The views and conclusions con-tained herein are those of the organizers and should not beinterpreted as necessarily representing the official policies,either expressed or implied, of ODNI, IARPA, or the U.S.Government. The U.S. Government is authorized to repro-duce and distribute reprints for governmental purposesnotwithstanding any copyright annotation therein.

Hardware support was generously provided by theNVIDIA Corporation, and made available by the National

15

Science Foundation (NSF) through grant #CNS-1629914. Wethank Drs. Adam Czajka, and Christopher Boehnen forconducting an impartial judgment for the challenge tracksto determine the winners, Mr. Vivek Sharma for executinghis code on our data and providing us with the result, KellyMalecki for her tireless effort in annotating the test datasetfor the UAV and Glider collections, and Sandipan Banerjeefor assistance with data collection.

REFERENCES

[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNetlarge scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp.211–252, 2015.

[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.521, no. 7553, p. 436, 2015.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into recti-fiers: Surpassing human-level performance on imagenet classifi-cation,” in IEEE ICCV, 2015.

[4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Clos-ing the gap to human-level performance in face verification,” inIEEE CVPR, 2014.

[5] R. G. Vidal, S. Banerjee, K. Grm, V. Struc, and W. J. Scheirer,“UG2: A video benchmark for assessing the impact of imagerestoration and enhancement on automatic visual recognition,”in IEEE WACV, 2018.

[6] S. Dodge and L. Karam, “Understanding how image qualityaffects deep neural networks,” in QoMEX, 2016.

[7] G. B. P. da Costa, W. A. Contato, T. S. Nazaré, J. do E. S.Batista Neto, and M. Ponti, “An empirical study on the effectsof different types of noise in image classification tasks,” CoRR,vol. abs/1609.02781, 2016.

[8] S. Dodge and L. Karam, “A study and comparison of human anddeep learning recognition performance under visual distortions,”in ICCCN, 2017.

[9] H. Hosseini, B. Xiao, and R. Poovendran, “Google’s cloud visionapi is not robust to noise,” in IEEE ICMLA, 2017.

[10] J. Yim and K. Sohn, “Enhancing the performance of convolu-tional neural networks on quality degraded datasets,” CoRR, vol.abs/1710.06805, 2017.

[11] B. RichardWebster, S. E. Anthony, and W. J. Scheirer, “Psyphy: Apsychophysics driven evaluation framework for visual recogni-tion,” IEEE T-PAMI, 2018, to Appear.

[12] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference onCurves and Surfaces, 2010.

[13] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel,“Low-complexity single-image super-resolution based on non-negative neighbor embedding,” in BMVC, 2012.

[14] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchoredneighborhood regression for fast super-resolution,” in ACCV,2014.

[15] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE CVPR,2015.

[16] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, andO. Wang, “Deep video deblurring,” CoRR, vol. abs/1611.08387,2016. [Online]. Available: http://arxiv.org/abs/1611.08387

[17] S. Diamond, V. Sitzmann, S. P. Boyd, G. Wetzstein, and F. Heide,“Dirty pixels: Optimizing image classification architectures forraw sensor data,” CoRR, vol. abs/1701.06487, 2017.

[18] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven superresolution: Object detection in low-resolution images,” CoRR, vol.abs/1803.11316, 2018.

[19] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Singleimage super-resolution through automated texture synthesis,”CoRR, vol. abs/1612.07919, 2016.

[20] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deepconvolutional network for image super-resolution,” in ECCV,2014.

[21] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“The unreasonable effectiveness of deep features as a perceptualmetric,” in IEEE CVPR, 2018.

[22] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” CoRR, vol.abs/1409.1556, 2014.

[23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,“Rethinking the inception architecture for computer vision,” inIEEE CVPR, 2016.

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in IEEE CVPR, 2016.

[25] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understand-ing and evaluating blind deconvolution algorithms,” in ComputerVision and Pattern Recognition (CVPR), 2009., June 2009.

[26] E. Agustsson and R. Timofte, “Ntire 2017 challenge on singleimage super-resolution: Dataset and study,” in IEEE CVPR Work-shops, 2017.

[27] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluationof recent full reference image quality assessment algorithms,”IEEE T-IP, vol. 15, no. 11, pp. 3440–3451, 2006.

[28] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in ECCV, 2014.

[29] R. B. Fisher, “The PETS04 surveillance ground-truth data sets,”in IEEE PETS Workshop, 2004.

[30] M. Grgic, K. Delac, and S. Grgic, “SCface–surveillance camerasface database,” Multimedia Tools and Applications, vol. 51, no. 3,pp. 863–879, 2011.

[31] J. Shao, C. C. Loy, and X. Wang, “Scene-independent groupprofiling in crowd,” in IEEE CVPR, 2014.

[32] X. Zhu, C. C. Loy, and S. Gong, “Video synopsis by heteroge-neous multi-source correlation,” in IEEE ICCV, 2013.

[33] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. Lee,S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears,X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash,D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai, “A large-scale benchmark dataset forevent recognition in surveillance video,” in IEEE CVPR, 2011.

[34] P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu, “Vision meetsdrones: A challenge,” arXiv preprint arXiv:1804.07437, 2018.

[35] “UCF Aerial Action data set,” http://crcv.ucf.edu/data/UCF_Aerial_Action.php.

[36] “UCF-ARG data set,” http://crcv.ucf.edu/data/UCF-ARG.php.[37] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simu-

lator for UAV tracking,” in ECCV, 2016.[38] B. Yao, X. Yang, and S.-C. Zhu, “Introduction to a large-scale

general purpose ground truth database: Methodology, annotationtool and benchmarks,” in Intl. Conference on Energy MinimizationMethods in Computer Vision and Pattern Recognition, 2007.

[39] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Efficientmarginal likelihood optimization in blind deconvolution,” inIEEE CVPR, 2011.

[40] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image anddepth from a conventional camera with a coded aperture,” ACMTransactions on Graphics (TOG), vol. 26, no. 3, 2007.

[41] N. Joshi, C. L. Zitnick, R. Szeliski, and D. J. Kriegman, “Imagedeblurring and denoising using color priors,” in IEEE CVPR,2009.

[42] A. Levin and B. Nadler, “Natural image denoising: Optimalityand inherent bounds,” in IEEE CVPR, 2011.

[43] N. M. Law, C. D. Mackay, and J. E. Baldwin, “Lucky imaging:high angular resolution imaging in the visible from the ground,”Astronomy & Astrophysics, vol. 446, no. 2, pp. 739–745, 2006.

[44] Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum, “Full-frame video stabilization with motion inpainting,” IEEE T-PAMI,vol. 28, no. 7, pp. 1150–1163, 2006.

[45] S. Cho and S. Lee, “Fast motion deblurring,” ACM Transactionson Graphics (TOG), vol. 28, no. 5, 2009.

[46] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image de-noising by sparse 3-D transform-domain collaborative filtering,”IEEE T-IP, vol. 16, no. 8, pp. 2080–2095, Aug 2007.

[47] A. Buades, B. Coll, and J. Morel, “Image denoising methods. anew nonlocal principle,” SIAM Rev., vol. 52, no. 1, pp. 113–147,2010.

[48] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz,“Adaptive deblocking filter,” IEEE TCSVT, vol. 13, no. 7, pp. 614–619, July 2003.

[49] J. S. L. Howard C. Reeve, “Reduction of blocking effects in imagecoding,” Optical Engineering, vol. 23, pp. 23 – 23 – 4, 1984.

[50] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive DCT for high-quality denoising and deblocking of

http://arxiv.org/abs/1611.08387

http://crcv.ucf.edu/data/UCF_Aerial_Action.php

http://crcv.ucf.edu/data/UCF_Aerial_Action.php

http://crcv.ucf.edu/data/UCF-ARG.php

16

grayscale and color images,” IEEE T-IP, vol. 16, no. 5, pp. 1395–1411, May 2007.

[51] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifactsreduction by a deep convolutional network,” in IEEE ICCV, 2015.

[52] S. Lin and H.-Y. Shum, “Separation of diffuse and specularreflection in color images,” in IEEE CVPR, 2001.

[53] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman, “Reflectionremoval using ghosting cues,” in IEEE CVPR, 2015.

[54] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolutionas sparse representation of raw image patches,” in IEEE CVPR,2008.

[55] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Transactions on ImageProcessing, vol. 19, no. 11, pp. 2861–2873, Nov 2010.

[56] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-basedsuper-resolution,” IEEE CG&A, vol. 22, no. 2, pp. 56–65, 2002.

[57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin,“Accurate blur models vs. image priors in single image super-resolution,” in IEEE ICCV, 2013.

[58] G. Freedman and R. Fattal, “Image and video upscaling fromlocal self-examples,” ACM Transactions on Graphics (TOG), vol. 30,no. 2, 2011.

[59] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neigh-borhood regression for fast example-based super-resolution,” inIEEE ICCV, 2013.

[60] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolutionusing very deep convolutional networks,” in IEEE CVPR, 2016.

[61] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep lapla-cian pyramid networks for fast and accurate superresolution,” inIEEE CVPR, 2017.

[62] R. Fattal, “Single image dehazing,” in ACM SIGGRAPH 2008Papers, ser. SIGGRAPH ’08. New York, NY, USA: ACM, 2008,pp. 72:1–72:9.

[63] Y. Y. Schechner, S. G. Narasimhan, and S. K. Nayar, “Instantdehazing of images using polarization,” in IEEE CVPR, 2001.

[64] K. He, J. Sun, and X. Tang, “Single image haze removal usingdark channel prior,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 33, no. 12, pp. 2341–2353, Dec 2011.

[65] R. T. Tan, “Visibility in bad weather from a single image,” in IEEECVPR, 2008.

[66] J. P. Tarel and N. Hautiere, “Fast visibility restoration from asingle color or gray level image,” in IEEE ICCV, 2009.

[67] X. Zhang, H. Li, Y. Qi, W. K. Leow, and T. K. Ng, “Rain removalin video by combining temporal and chromatic properties,” inIEEE International Conference on Multimedia and Expo, 2006.

[68] P. C. Barnum, S. Narasimhan, and T. Kanade, “Analysis of rainand snow in frequency space,” IJCV, vol. 86, no. 2, p. 256, Jan2009.

[69] S. Y. Jiaying Liu, Wenhan Yang and Z. Guo, “Erase or fill? deepjoint recurrent rain removal and reconstruction in videos,” IEEECVPR, 2018.

[70] J. Chen, C. Tan, J. Hou, L. Chau, and H. Li, “Robust videocontent alignment and compensation for rain removal in a CNNframework,” CoRR, vol. abs/1803.10433, 2018.

[71] J. Guo and H. Chao, “One-to-many network for visually pleasingcompression artifacts reduction,” in IEEE CVPR, 2017.

[72] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyonda Gaussian denoiser: Residual learning of deep CNN for imagedenoising,” IEEE T-IP, vol. 26, no. 7, pp. 3142–3155, July 2017.

[73] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memorynetwork for image restoration,” in IEEE ICCV, 2017.

[74] K. Yu, C. Dong, L. Lin, and C. C. Loy, “Crafting a toolchain forimage restoration by deep reinforcement learning,” CoRR, vol.abs/1804.03312, 2018.

[75] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon-volutional networks,” in IEEE CVPR, 2010.

[76] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolu-tional networks for mid and high level feature learning,” in IEEEICCV, 2011.

[77] M. Waleed Gondal, B. Schölkopf, and M. Hirsch, “The Unrea-sonable Effectiveness of Texture Transfer for Single Image Super-resolution,” ArXiv e-prints, Jul. 2018.

[78] K. Tahboub, D. GÃijera, A. R. Reibman, and E. J. Delp, “Quality-adaptive deep learning for pedestrian detection,” in IEEE ICIP,2017.

[79] M. Hradiš, J. Kotera, P. Zemcík, and F. Šroubek, “Convolutionalneural networks for direct text deblurring,” in BMVC, 2015.

[80] L. Xiao, J. Wang, W. Heidrich, and M. Hirsch, “Learning high-order filters for efficient blind deconvolution of document pho-tographs,” in ECCV, 2016.

[81] R. Zhang, P. Isola, and A. A. Efros, “Colorful imagecolorization,” CoRR, vol. abs/1603.08511, 2016. [Online].Available: http://arxiv.org/abs/1603.08511

[82] V. P. Namboodiri, V. D. Smet, and L. V. Gool, “Systematic evalu-ation of super-resolution using classification,” in VCIP, 2011.

[83] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. V. Gool, andR. Stiefelhagen, “Classification driven dynamic image enhance-ment,” CoRR, vol. abs/1710.07558, 2017.

[84] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “AOD-Net: All-in-onedehazing network,” in IEEE ICCV, 2017.

[85] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A. Abidi,“Improving long range and high magnification face recognition:Database acquisition, evaluation, and enhancement,” CVIU, vol.111, no. 2, pp. 111–125, 2008.

[86] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, andO. Yamaguchi, “Facial deblur inference to improve recognitionof blurred faces,” in IEEE CVPR, 2009.

[87] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang,“Close the loop: Joint blind image restoration and recognitionwith sparse representation prior,” in IEEE ICCV, 2011.

[88] C. Fookes, F. Lin, V. Chandran, and S. Sridharan, “Evaluation ofimage resolution and super-resolution on face recognition perfor-mance,” Journal of Visual Communication and Image Representation,vol. 23, no. 1, pp. 75 – 93, 2012.

[89] F. W. Wheeler, X. Liu, and P. H. Tu, “Multi-frame super-resolutionfor face recognition,” in IEEE BTAS, 2007.

[90] J. Wu, S. Ding, W. Xu, and H. Chao, “Deep joint face hallucinationand recognition,” CoRR, vol. abs/1611.08091, 2016.

[91] F. Lin, J. Cook, V. Chandran, and S. Sridharan, “Face recognitionfrom super-resolved images,” in Proceedings of the Eighth Interna-tional Symposium on Signal Processing and Its Applications, 2005.

[92] F. Lin, C. Fookes, V. Chandran, and S. Sridharan, “Super-resolvedfaces for improved face recognition from surveillance video,”Advances in Biometrics, 2007.

[93] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar, “Simulta-neous super-resolution and feature extraction for recognition oflow-resolution faces,” in IEEE CVPR, 2008.

[94] J. Yu, B. Bhanu, and N. Thakoor, “Face recognition in video withclosed-loop super-resolution,” in IEEE CVPRW, 2011.

[95] H. Huang and H. He, “Super-resolution method for face recogni-tion using nonlinear mappings on coherent features,” IEEE T-NN,vol. 22, no. 1, pp. 121–130, 2011.

[96] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel, “Facial im-age super resolution using sparse representation for improvingface recognition in surveillance monitoring,” in SIU, 2016.

[97] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari, “Convo-lutional neural network super resolution for face recognition insurveillance monitoring,” in International Conference on ArticulatedMotion and Deformable Objects, 2016.

[98] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu,and B. Xu, “Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning,” in IEEECVPR, 2015.

[99] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“The unreasonable effectiveness of deep features as a perceptualmetric,” arXiv preprint, 2018.

[100] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” inIEEE CVPR, 2018.

[101] F. Kingdom and N. Prins, Psychophysics: a Practical Introduction.Academic Press, 2016.

[102] M. J. C. Crump, J. V. McDonnell, and T. M. Gureckis, “EvaluatingAmazon’s Mechanical Turk as a tool for experimental behavioralresearch,” in PloS one, 2013.

[103] F. Chollet et al., “Keras,” https://github.com/fchollet/keras,2015.

[104] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge,MA: MIT Press, 1998.

[105] C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scalingup crowdsourced video annotation,” IJCV, vol. 101, no. 1, pp.184–204, Jan. 2013.

[106] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pajak, D. Reddy,O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, J. Kautz, and K. Pulli,“Flexisp: A flexible camera image processing framework,” ACM


https://github.com/fchollet/keras

17

Trans. Graph., vol. 33, no. 6, pp. 231:1–231:13, Nov. 2014. [Online].Available: http://doi.acm.org/10.1145/2661229.2661260

[107] C. J. Pellizzari, R. Trahan, H. Zhou, S. Williams, S. E. Williams,B. Nemati, M. Shao, and C. A. Bouman, “Optically coherentimage formation and denoising using a plug and play inversionframework,” Applied optics, vol. 56, no. 16, pp. 4735–4744, 2017.

[108] J. R. Chang, C.-L. Li, B. Poczos, and B. V. Kumar, “One networkto solve them allâATsolving linear inverse problems using deepprojection models,” in 2017 IEEE International Conference on Com-puter Vision (ICCV). IEEE, 2017, pp. 5889–5898.

[109] K. Zuiderveld, “Contrast limited adaptive histogram equaliza-tion,” Graphics Gems, pp. 474–485, 1994.

[110] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas,“Deblurgan: Blind motion deblurring using conditional adver-sarial networks,” CoRR, vol. abs/1711.07064, 2017.

[111] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan,“Deep edge guided recurrent residual learning for image super-resolution,” IEEE T-IP, vol. 26, no. 12, pp. 5895–5907, Dec 2017.

[112] X. Mao, C. Shen, and Y. Yang, “Image restoration usingconvolutional auto-encoders with symmetric skip connections,”CoRR, vol. abs/1606.08921, 2016. [Online]. Available: http://arxiv.org/abs/1606.08921

[113] X.-J. Mao, C. Shen, and Y.-B. Yang, “Image restoration using verydeep convolutional encoder-decoder networks with symmetricskip connections,” in NIPS, 2016.

[114] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand,“Deep bilateral learning for real-time image enhancement,” ACMTransactions on Graphics (TOG), vol. 36, no. 4, p. 118, 2017.

[115] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Deep image prior,”CoRR, vol. abs/1711.10925, 2017.

[116] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in MICCAI, 2015.

[117] H. Chen, T. Liu, and T. Chang, “Tone reproduction: A perspectivefrom luminance-driven perceptual grouping,” in IEEE CVPR,2005.

[118] “Spacenet,” Apr 2018. [Online]. Available: https://spacenetchallenge.github.io/

[119] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”in IEEE ICCV, 2017.

[120] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Leastsquares generative adversarial networks,” in IEEE ICCV, 2017.

[121] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in ICML,2015.

[122] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normal-ization: The missing ingredient for fast stylization,” CoRR, vol.abs/1607.08022, 2016.

[123] R. Keys, “Cubic convolution interpolation for digital imageprocessing,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 29, no. 6, pp. 1153–1160, Dec 1981.

[124] J. Pan, D. Sun, H.-P. Pfister, and M.-H. Yang, “Blind imagedeblurring using dark channel prior,” in IEEE CVPR, 2016.

[125] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolu-tional neural network for dynamic scene deblurring,” CoRR, vol.abs/1612.02177, 2016.

[126] D. Kundur and D. Hatzinakos, “Blind image deconvolution,”IEEE Signal Processing Magazine, vol. 13, no. 3, pp. 43–64, 1996.

[127] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang,and E. Shechtman, “Toward multimodal image-to-image transla-tion,” in NIPS, 2017.

Rosaura G. VidalMata is a Ph.D. student in theComputer Science and Engineering Departmentat the University of Notre Dame. She receivedthe B.S. degree in Computer Science at the Tec-nologico de Monterrey (ITESM) in 2015, wereshe graduated with an Honorable Mention forExcellence. Her research interests include com-puter vision, machine learning and biometrics.

Sreya Banerjee is a Ph.D. student in the Com-puter Science and Engineering department atthe University of Notre Dame. She receivedthe B.S. degree in Information Technology fromWest Bengal University of Technology, Kolkata,India in 2010. Her research interests includecomputer vision, machine learning, computa-tional neuroscience.

Brandon RichardWebster is a National Sci-ence Foundation (NSF) Graduate Research Fel-low and Ph.D. student in the Department ofComputer Science and Engineering at the Uni-versity of Notre Dame. He received the B.S. de-gree in computer science from Bethel University,Minnesota, USA, in 2015. His research interestsinclude computer vision, machine learning, andvisual psychophysics.

Michael Albright is a Data Scientist at Hon-eywell ACST, which he joined in 2015. He re-ceived his B.S. degree in Physics from St. CloudState University in 2009 and his Ph.D. in Theo-retical Physics from the University of Minnesotain 2015. His current research interests includecomputer vision, deep learning, and data sci-ence applied to the internet of things.

Pedro Davalos is a Principal Data Scientist atHoneywell ACST, joined in 2010. He obtainedthe B.S. degree in Computer Engineering, andthe M.S. degree in Computer Science, both fromTexas A&M University. His research interests in-clude machine learning and computer vision.

Scott McCloskey is a Technical Fellow at Hon-eywell ACST, which he joined in 2007. He ob-tained his Ph.D. from McGill University, a MSfrom the Rochester Institute of Technology, anda BS from the University of Wisconsin âAS Madi-son. His research interests include computer vi-sion, computational photography, and industrialapplications of imaging and recognition

Ben Miller is a Software Engineer at Honey-well ACST, he joined in 2005. He received hisMS in Computer Science from the University ofMinnesota in 2006. His research interests in-clude cybersecurity, distributed computing, ma-chine learning frameworks and applications incomputer vision.

Asongu Tambo is a Research Scientist as Hon-eywell ACST, joined in 2017. He obtained hisPh.D. in Electrical Engineering from the Uni-versity of California at Riverside in 2016. Dr.Tambo’s research interests include image pro-cessing, tracking, learning, and the applicationof computer vision techniques to new frontierssuch as Biology.

Sushobhan Ghosh is a Ph.D. student in theComputer Science Department working in theComputational Photography Lab at Northwest-ern University, Evanston. He completed his un-dergrad at Indian Institute of Technology Delhi(IIT Delhi), India in 2016. Sushobhan is inter-ested in deep learning/machine learning appli-cations in computer vision and computationalphotography problems.

http://doi.acm.org/10.1145/2661229.2661260



https://spacenetchallenge.github.io/

https://spacenetchallenge.github.io/

18

Sudarshan Nagesh is a Research Engineerworking at Zendar Co., Berkeley. He completedhis undergraduation in electrical engineeringat National Institute of Technology Karnataka(NITK Surathkal), India in 2012. He receivedan MSc. Engg degree from the electrical en-gineering department at the Indian Institute ofScience, Bangalore in 2015. In December 2017he completed his MS in computational imagingfrom Rice University.

Ye Yuan is a Ph.D. student in Computer Scienceat the Texas A&M University. She obtained theB.S. degree at the University of Science andTechnology of China (USTC) in 2015. During2014-2015, she was a visiting student internand completed her undergraduate thesis at RiceUniversity. Ye Yuan focuses her research on ma-chine learning, deep learning, computer visiontechniques and applications especially in personre-identification and healthcare.

Yueyu Hu received the B.S. degree in computerscience from Peking University, Beijing, China,in 2018, where he is currently pursuing the mas-ter’s degree with the Institute of Computer Sci-ence and Technology. His current research in-terests include video and image compression,understanding and machine intelligence.

Junru Wu received the BE degree in ElectricalEngineering from Tongji University in 2016. From2016 to 2017, he was a research assistant atShanghaiTech University. He is currently a Ph.D.student at the Visual Informatics Group at TexasA&M University.His research interests lie in thebroad areas of deep learning and computervision, specifically including image restoration,saliency detection, and deep network compres-sion.

Wenhan Yang is a postdoctoral researcher atLearning and Vision Lab, Department of Electri-cal and Computer Engineering, National Univer-sity of Singapore. He received the Ph.D. degree(Hons.) in computer science from Peking Univer-sity, Beijing, China 2018. His current researchinterests include image/video processing, com-pression, and computer vision.

Xiaoshuai Zhang is an undergraduate studentat Peking University, where he is majoring inMachine Intelligence and Computer Science. Heis currently working as research intern at theInstitute of Computer Science and Technology,Peking University. His current research interestsinclude image processing, computer vision, ma-chine learning and security.

Jiaying Liu is currently an Associate Profes-sor with the Institute of Computer Science andTechnology, Peking University. She received thePh.D. degree (Hons.) in computer science fromPeking University, Beijing, China 2010. She hasauthored over 100 technical articles in refereedjournals and proceedings and holds 28 patents.Her research interests include image/video pro-cessing, compression, and computer vision.

Zhangyang Wang is an Assistant Professorof Computer Science and Engineering, at theTexas A&M University. During 2012-2016, hewas a Ph.D. student in the Electrical and Com-puter Engineering Department, at the Universityof Illinois at Urbana-Champaign. Prior to that,he obtained the B.E. degree at the Universityof Science and Technology of China (USTC), in2012.

Hwann-Tzong Chen joined the Department ofComputer Science, National Tsing Hua Univer-sity, Taiwan, as a faculty member in 2006. He iscurrently an associate professor at the depart-ment. He received his PhD degree in computerscience from National Taiwan University. Dr.ChenâAZs research interests include computervision, image processing, and pattern recogni-tion.

Tzu-Wei Huang is a PhD student and an ad-junct lecturer in National Tsing-Hua University.He loves open source projects and contributescode to the community. His research interestsinclude computer vision and deep learning.

Wen-Chi Chin is a graduate student at NationalTsing Hua University in Taiwan, she majoredin Computer Science. Her focus is on the fieldof computer vision, such as video stabilization,SLAM and vehicle localization, and unsuper-vised depth and ego-motion estimation. Also,she participated in the project of Industrial Tech-nology Research Institute. They developed theScrabble-Playing Robot with Resnet-34 and wonthe CES 2018 Innovation Awards.

Yi-Chun Li is a MS student in National TsingHua University. Her research interests includecomputer vision, deep learning and deep rein-forcement learning. Her focus is on topics suchas video stabilization, object detection, and plan-ning network with deep reinforcement learningfor PCB auto-routing.

Mahmoud Lababidi (PhD Physics George Ma-son University, BS Computer Engineering Uni-versity of Florida) has studied Topological In-sulators (such as Graphene) along with theirinterplay with superconductors during his PhD.He navigated to Machine Learning by way of theRemote Sensing world by applying and modify-ing neural network based computer vision tech-niques on satellite based images.

Charles Otto received the B.S. degree in Com-puter Science at Michigan State University in2008, and the Ph.D. degree in Computer Sci-ence at Michigan State University in 2016. Hecurrently works at Noblis, in Reston Virginia. Hisresearch interests include face recognition, com-puter vision, and pattern recognition.

Walter J. Scheirer received the M.S. degreein computer science from Lehigh University, in2006, and the Ph.D. degree in engineering fromthe University of Colorado, Colorado Springs,CO, USA, in 2009. He is an Assistant Professorwith the Department of Computer Science andEngineering, University of Notre Dame. Prior tojoining the University of Notre Dame, he was aPostdoctoral Fellow with Harvard University from2012 to 2015, and the Director of Research andDevelopment with Securics, Inc., from 2007 to

2012. His research interests include computer vision, machine learning,biometrics, and digital humanities

1 bridging the gap between computational photography and visual recognition … · 2019-01-29 ·...

Documents