a fast, scalable, and reliable deghosting method for extreme exposure ...val.serc.iisc.ernet.in ›...

8
A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure Fusion K. Ram Prabhakar*, Rajat Arora*, Adhitya Swaminathan, Kunal Pratap Singh, and R. Venkatesh Babu Video Analytics Lab, Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, Karnataka-560012, INDIA HDR fusion of extreme exposure images with complex camera and object motion is a challenging task. Existing patch-based optimization techniques generate noisy and/or blurry results with undesirable artifacts for difficult scenarios. Additionally, they are computationally intensive and have high execution times. Recently proposed CNN-based methods offer fast alternatives, but still fail to generate artifact-free results for extreme exposure images. Furthermore, they do not scale to an arbitrary number of input images. To address these issues, we propose a simple, yet effective CNN-based multi-exposure image fusion method that produces artifact-free HDR images. Our method is fast, and scales to an arbitrary number of input images. Additionally, we prepare a large dataset of 582 varying exposure images with corresponding deghosted HDR images to train our model. We test the efficacy of our algorithm on publicly available datasets, and achieve significant improvements over existing state-of-the-art methods. Through experimental results, we demonstrate that our method produces artifact-free results, and offers a speed-up of around 54× over existing state-of-the-art HDR fusion methods. Index Terms—Exposure Fusion, High Dynamic Range Imaging, Deghosting. I. I NTRODUCTION A DEQUATE lighting is important for commercial cameras to capture visually appealing photographs. However, it is unlikely that such a condition is encountered in common scenes. Often, the dynamic range of the scene far exceeds the hardware limit of standard digital camera sensors. In such scenarios, the resulting photos will consist of saturated regions, which are either too dark or too bright to visually comprehend. Keeping in mind the constraints posed by camera hardware, an algorithmic solution is to merge multiple Low Dynamic Range (LDR) images with varying exposure, also known as an exposure stack, into a single High Dynamic Range (HDR) image. This helps bring out details in both dark and bright saturated regions. Multi-Exposure Fusion (MEF) methods merge well-exposed regions from numerous LDR input images to produce a single visually appealing LDR result that appears to possess higher dynamic range. It is a simple problem in case of perfectly static sequences ( [4]–[6]). However, in most practical situations, a certain amount of camera and object motions are inevitable, leading to ghost-like artifacts in the final fused result. Various techniques have been proposed in literature to address the ghosting problem in complex HDR fusion 1 . While camera motion can be addressed using global alignment meth- ods ( [9], [10]), deghosting of moving objects is much harder to tackle. Rejection based deghosting methods ( [11]–[13]) fail to reconstruct HDR image content for moving objects with complex motion. Using non-rigid registration techniques [14]– [16] introduce visible artifacts in case of complex deformable motion. A paradigm shift in deghosting methods was offered by patch based synthesis methods such as Hu et al. [17] and Sen et al. [18]. They offer impressive high quality results even for moving and occluded pixels. However, these methods * Equal contribution. 1 For more detailed review on literature methods, please refer to [7], [8]. are computationally heavy and still produce noisy or blurry results with artifacts for scenes with very complex and extreme dynamic range. Gallo et al. [19] offer a locally non-rigid registration method that strikes balance between execution time and accuracy. Learning-based methods: Kalantari and Ramamoorthi [1] proposed the first deep learning based HDR deghosting algo- rithm. In their method, the input images are aligned using a traditional optical flow technique. The distortions introduced due to optical flow warping are corrected using a simple CNN. However, their method still generates artifacts for extreme dynamic range scenes (see Fig. 1). Recently, Wu et al. [2] proposed treating HDR fusion as an image translation problem, and solving it using a CNN-based architecture. While they do not explicitly perform foreground alignment (like [1], [17], [18]), they showed that their network generates accurate HDR images irrespective of large foreground motion. However, this method is highly dependent on the structure of the reference image. Thus, if certain regions are saturated in the reference image, it fails to accurately reconstruct them in the final result (see Figure 5). A major bottleneck with learning-based approaches is scalability to an arbitrary number of images. Extending both methods to an arbitrary number of images requires training a new model, which can be computationally expensive. Limitations: The deghosting algorithms mentioned above face a number of major challenges: (1) Handling extreme exposure images, (2) Complex scene motion, (3) Scalability to an arbitrary number of images, and (4) Speed. In order to address these challenges, we propose a simple, yet ef- fective CNN-based MEF algorithm. Our proposed network, diagrammatically represented by Fig. 2 consists of multiple components. We start by registering the input images using a powerful optical flow network and a refinement network (Section II-B). The registered images are then passed to a feature encoder module to extract exposure-invariant features.

Upload: others

Post on 24-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

A Fast, Scalable, and Reliable Deghosting Method for ExtremeExposure Fusion

K. Ram Prabhakar*, Rajat Arora*, Adhitya Swaminathan, Kunal Pratap Singh, and R. Venkatesh BabuVideo Analytics Lab, Department of Computational and Data Sciences,

Indian Institute of Science, Bengaluru, Karnataka-560012, INDIA

HDR fusion of extreme exposure images with complex camera and object motion is a challenging task. Existing patch-basedoptimization techniques generate noisy and/or blurry results with undesirable artifacts for difficult scenarios. Additionally, they arecomputationally intensive and have high execution times. Recently proposed CNN-based methods offer fast alternatives, but stillfail to generate artifact-free results for extreme exposure images. Furthermore, they do not scale to an arbitrary number of inputimages. To address these issues, we propose a simple, yet effective CNN-based multi-exposure image fusion method that producesartifact-free HDR images. Our method is fast, and scales to an arbitrary number of input images. Additionally, we prepare a largedataset of 582 varying exposure images with corresponding deghosted HDR images to train our model. We test the efficacy ofour algorithm on publicly available datasets, and achieve significant improvements over existing state-of-the-art methods. Throughexperimental results, we demonstrate that our method produces artifact-free results, and offers a speed-up of around 54× overexisting state-of-the-art HDR fusion methods.

Index Terms—Exposure Fusion, High Dynamic Range Imaging, Deghosting.

I. INTRODUCTION

ADEQUATE lighting is important for commercial camerasto capture visually appealing photographs. However, it

is unlikely that such a condition is encountered in commonscenes. Often, the dynamic range of the scene far exceedsthe hardware limit of standard digital camera sensors. In suchscenarios, the resulting photos will consist of saturated regions,which are either too dark or too bright to visually comprehend.Keeping in mind the constraints posed by camera hardware,an algorithmic solution is to merge multiple Low DynamicRange (LDR) images with varying exposure, also known asan exposure stack, into a single High Dynamic Range (HDR)image. This helps bring out details in both dark and brightsaturated regions.

Multi-Exposure Fusion (MEF) methods merge well-exposedregions from numerous LDR input images to produce a singlevisually appealing LDR result that appears to possess higherdynamic range. It is a simple problem in case of perfectly staticsequences ( [4]–[6]). However, in most practical situations, acertain amount of camera and object motions are inevitable,leading to ghost-like artifacts in the final fused result.

Various techniques have been proposed in literature toaddress the ghosting problem in complex HDR fusion1. Whilecamera motion can be addressed using global alignment meth-ods ( [9], [10]), deghosting of moving objects is much harderto tackle. Rejection based deghosting methods ( [11]–[13])fail to reconstruct HDR image content for moving objects withcomplex motion. Using non-rigid registration techniques [14]–[16] introduce visible artifacts in case of complex deformablemotion. A paradigm shift in deghosting methods was offeredby patch based synthesis methods such as Hu et al. [17] andSen et al. [18]. They offer impressive high quality resultseven for moving and occluded pixels. However, these methods

* Equal contribution.1For more detailed review on literature methods, please refer to [7], [8].

are computationally heavy and still produce noisy or blurryresults with artifacts for scenes with very complex and extremedynamic range. Gallo et al. [19] offer a locally non-rigidregistration method that strikes balance between executiontime and accuracy.

Learning-based methods: Kalantari and Ramamoorthi [1]proposed the first deep learning based HDR deghosting algo-rithm. In their method, the input images are aligned using atraditional optical flow technique. The distortions introduceddue to optical flow warping are corrected using a simple CNN.However, their method still generates artifacts for extremedynamic range scenes (see Fig. 1). Recently, Wu et al. [2]proposed treating HDR fusion as an image translation problem,and solving it using a CNN-based architecture. While they donot explicitly perform foreground alignment (like [1], [17],[18]), they showed that their network generates accurate HDRimages irrespective of large foreground motion. However, thismethod is highly dependent on the structure of the referenceimage. Thus, if certain regions are saturated in the referenceimage, it fails to accurately reconstruct them in the finalresult (see Figure 5). A major bottleneck with learning-basedapproaches is scalability to an arbitrary number of images.Extending both methods to an arbitrary number of imagesrequires training a new model, which can be computationallyexpensive.

Limitations: The deghosting algorithms mentioned aboveface a number of major challenges: (1) Handling extremeexposure images, (2) Complex scene motion, (3) Scalabilityto an arbitrary number of images, and (4) Speed. In orderto address these challenges, we propose a simple, yet ef-fective CNN-based MEF algorithm. Our proposed network,diagrammatically represented by Fig. 2 consists of multiplecomponents. We start by registering the input images usinga powerful optical flow network and a refinement network(Section II-B). The registered images are then passed to afeature encoder module to extract exposure-invariant features.

Page 2: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

(a) Input (b) Kalantari and Ramamoorthi [1] (c) Wu et al. [2] (d) Proposed

Fig. 1: Fused images by (a) Kalantari and Ramamoorthi [1] shows structural artifacts in the region outside window (shown inzoomed blue inset box). (b) Wu et al. [2] shows hallucinated regions below the table (shown in zoomed green inset box). Incontrast, our method produces artifact free pleasing results without any hallucinations. Input images are taken from Cambridgedataset [3]. Image best viewed zoomed in electronic monitor.

We achieve scalability by training a single fully convolutionalCNN feature encoder to extract features, irrespective of ex-posure bias. The common feature extractor ensures that ourmodel remains independent of the number of input images.Finally, the extracted features are merged using a decoderCNN to generate artifact-free fused images (Section II-C).

In summary, our contributions are as follows:• We present a simple, fast and scalable CNN model for

MEF deghosting that can handle even extreme exposuredifferences in the input exposure stack.

• We provide a large varying exposure image dataset of582 sequences with moving objects and correspondingground-truth fused HDR images.

• We conduct extensive quantitative and qualitative eval-uation of our model against state-of-the-art deghostingalgorithms with publicly available datasets.

II. PROPOSED METHOD

A. Overall approach

In this section, we describe our-CNN based MEF deghostingmethod. The block diagram of our proposed approach is shownin Fig. 2. Our model consists of three major modules: animage alignment module, a feature extraction module, and animage reconstruction module. Given an exposure stack withN varying exposure images, one of them is selected as thereference image. In our implementation, we choose the middleimage in the stack as the reference image during training(under exposed image, in case of image pairs). During testinghowever, any image in the stack can be used as the referenceimage to generate results. We select the image with the leastamount of saturated pixels as the reference. The remainingN-1 images in the stack are aligned to the reference imageusing a CNN-based optical flow network. We use PWC-Net[20], a lightweight pyramidal optical flow estimation networkfor this purpose. The aligned images may contain artifactsin regions with non-rigid motion and occluded pixels. Tocorrect this, we refine the warped output using our proposedrefinement network (see Fig. 3). From the N aligned andrefined images, we extract N convolutional features from usinga feature encoder. These features are merged using a featureaggregation operation to generate the features corresponding

to the fused result. Finally, the fused image is reconstructedfrom the fused feature map using a decoder.

Since all the stages in our model are independent of thenumber of input images, we explain the specifics of ourmethod in further sections using two images: the referenceimage (I1) and the source image (I2), with the aim of aligningthe source image to the reference image and fusing them.Extending our model to any arbitrary N images, requiresN-1 optical flow estimation operations, N feature extractionoperations, and one image reconstruction operation. Unlike [1]and [2], our method does not require retraining for a differentnumber of inputs. We now examine each of our modules indetail.

B. Image alignment

Optical flow network N1: The objective of the alignmentphase is to warp the source image I2 to the reference imageI1. The input images I1 and I2 are assumed to be in linearRAW format. If not, we linearize them with the help of theCamera Response Function [21]. We use PWC-Net [20] forcomputing optical flow because of its high accuracy and lowexecution time. However, any optical flow estimation methodcan be utilized in this step. Since optical flow computationrequires brightness constancy for better performance, we raisethe exposure of the darker image to match that of the brighterimage using the formula:

I ′1 =

(Iγ1 ×

t2t1

) 1γ

(1)

where, t1 and t2 are the exposure times of I1 and I2, andγ=2.2. The flow (F) obtained from PWC-net (N1) is used towarp I2 to generate Iw2 :

[F , Iw2 ] = N1(I′1, I2) (2)

Refinement network N2: Iw2 is prone to artifacts due toerrors in optical flow estimation and occlusion. To correctthese, we utilize a U-Net [22] style network to refine Iw2and estimate I2, which represents an approximation of I1at the exposure value of I2. N2 takes three inputs: Iw2 , I ′1(obtained from Eqn. 1), and flow F . We stack the inputs in

Page 3: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

Fig. 2: Overview of our approach. N varying exposure input images are registered to reference image (shown in middle)using optical flow and refinement network. The registered images are passed to a feature extraction network. We aggregate theN features to obtain fused image feature. Finally, a feature decoder model is used to reconstruct final pseudo-HDR image.

Fig. 3: We illustrate the working of our method with a varying exposure image pair (I1, I2) as input. First, we compute the flowbetween the exposure modified reference image, I ′1 and I2. The warped image Iw2 may consists of artifacts that are correctedby the refinement network (N2) with the help of the optical flow (F) and I ′1. Finally, the artifact-free warped image I2 isfused with I1 by Fusion net (N3) to generate the pseudo-HDR result. The overall approach consists of two stages of training.In the first stage, the fusion network (N3) is trained separately with static input images. In the second stage, N3 is frozen andthe loss between the predicted fused image and the ground truth image (Loss2) is used to train the refinement network alongwith Loss1. Note that the weights of the optical flow network (N1) are frozen throughout the training process.

the depth dimension to obtain eight channel input data. N2 isthen trained to predict a single channel weight map:

W = N2(Iw2 , I

′1,F)

Finally, I2 is then obtained using the formula:

I2 = (1−W )× Iw2 +W × I ′1 (3)

Then, N2 is trained with loss computed between predictedimage I2 and ground truth Igt2 with L2 loss:

Loss1 = ‖I2 − Igt2 ‖2 (4)

The weight map primarily needs to emphasize details fromI2 that may be lost during the estimation of I ′1. Training therefinement network with just the above mentioned L2 losscauses the network to get stuck in a local minimum, outputtinguniform values of one in W, giving I ′1 as output. This occursbecause for most pixels, I ′1 is very close to the ground truth.

To counter this, we regularize the loss function using the L2

norm of the weight map.

Loss1 = ‖I2 − Igt2 ‖2+α1 × ‖W‖2 (5)

The use of such a regularizer can give rise to ghosting artifacts.To minimize this, the weight of the regularizer is decreasedafter 15 epochs. A sample result from the refinement stagecan be seen in Fig. 4. It should be noted that Loss1 is usedto train only N2, while the weights of N1 are frozen.

The refinement network is trained with the Adam optimizerfor 45 epochs, starting with a learning rate of 1e-4 and α1

of 1e-3 for the first 15 epochs. After that, α1 is set to 2.5e-5,and is decreased by 0.25e-5 every 5 epochs till the 35th epoch,after which these parameters are kept unchanged for the final10 epochs. The learning rate is decreased by 0.25e-4 every 5epochs till the 25th epoch, and is then kept unchanged.

Page 4: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

(a) EV−2 image (b) EV0 image (c) Estimated weight map W

Fig. 4: (c) shows the weight map output by the refinement network for a sample input image set from the Kalantari andRamamoorthi [1] dataset. Notice that in the saturated regions of the EV0 image, the network assigns very little weightage forthe EV0 image. At the same time, in the occluded regions, it assigns high weightage for the reference EV0 image. For theremaining regions, it assigns values from 0− 1 based on structural consistency and saturation levels.

C. Image fusion

Since both the optical flow (N1) network and the refinementnetwork (N2) work on image pairs, they can be extended toalign any number of input images. The primary challenge is toextend the fusion network to operate on an arbitrary number ofimages. Kalantari and Ramamoorthi [1] and Wu et al. [2] stackinput images in the depth dimension. This means that theirmodels have to be re-trained in order to fuse a different numberof input images. In contrast, we follow the traditional imagefusion strategy: extract individual image features, merge thesefeatures based on certain heuristics, and finally reconstruct thefused image from the merged features. To our knowledge, ourmethod is the first CNN-based fusion approach that can fusean arbitrary number of images.

The input to our fusion module is the set of N alignedimages. Within the fusion module (Fig. 2), we utilize anencoder E to obtain exposure-invariant features from all Nimages, merge these N features using a feature aggregatoroperation, and finally use a decoder D to reconstruct the finalfused image. Since the feature extractor E is unchanged for allinput images, our method can be used to merge any numberof images.

We follow a standard encoder-decoder architecture for im-age fusion, with the exception that encoder features from allthe images are aggregated before being passed to decoder. Theencoder consists of 4 convolutional layers with 8, 16, 32 and64 filters respectively, followed by a strided convolution thatdownsample the feature maps by a factor of 2. The decoderconsists of four transpose-convolutional layers to increase thespatial resolution of the feature maps. We use the LeakyReLUactivation function everywhere with α = 0.2. We investigatethree unique methods of feature aggregation: 1) Mean - Meanof N feature maps, 2) Max - Max of N feature maps,and 3) Mean + Max - both mean and max of N featuresmaps are concatenated. From the results reported in TableII, we observe that the Mean + Max fusion strategy scaleswell for multiple images. Thus, we choose Mean + Maxas the preferred feature aggregation operator. The aggregatedfeatures from the corresponding encoding block are also fedto the decoder through skip connections. The fusion network

is trained using an L2 loss, an MS-SSIM loss [23], and agradient magnitude loss between the ground truth (Hgt) andthe predicted pseudo HDR image (H):

Loss2 = β1‖H −Hgt‖2+β2[1−MSSSIM(H,Hgt)]

+ β3‖∇mH −∇mHgt‖2 (6)

where, ∇mH denotes the gradient magnitude of image H .Empirically, we found that, setting [β1, β2, β3] with [0.3, 1, 6]gave better results. We train our fusion network with 3 imagesof varying exposure but it is extensible to arbitrary number ofimages. The fusion network is trained with Adam optimizerover static images for 45 epochs. Learning rate is set to 1e-4for first 15 epochs, after which, learning rate is halved afterevery 10 epochs.

III. EVALUATION AND RESULTS

A. Dataset

Considering the scarcity of large datasets for MEF inliterature, we create a new dataset that could prove to bebeneficial for the research community. We follow the captureprocess used by Kalantari and Ramamoorthi [1] to capturetwo sets of images for the same scene: reference and ghostedsets. In the reference set, the subject was asked to remainstill during the capture process. In the ghosted set, the subjectwas asked to perform some action. Each set consists ofthree to seven images with exposure values between −3EVto +3EV (inclusive). All the images were captured in thecamera’s native RAW format, and later converted to TIFFusing dcraw software. The captured images have differentresolutions, ranging from 1 to 4 Megapixels. We capturedaround 700 sequences with real-life object motion in variousscenarios. We used three different capture devices to collectthese sequences: Canon EOS 600D, Canon EOS 500D andan LG V30 smartphone. All the images were captured bymounting the device on a static tripod. In order to avoid cameramotion due to manual shutter release, we used a wirelessremote trigger. Out of the 700 captured sequences, we removed118 sequences with background motion or object motion.Finally, we split the 582 sequences into random, distinct sets:

Page 5: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

TABLE I: Quantitative comparison with state of the art meth-ods. Here, we compare proposed method with Hu et al. [17]and Ma et al. [5] with full reference IQA metrics PSNR andSSIM. We report the results for 2 and 3 images per sequencefusion.

Number ofimages (→) 2 3

Metrics (→)Methods (↓) PSNR SSIM PSNR SSIM

Hu et al. [17] 27.04 0.925 31.31 0.9417Ma et al. [5] 25.67 0.913 26.91 0.9276Proposed (with flow) 29.61 0.952 30.52 0.9496Proposed (without flow) 28.12 0.938 28.47 0.9465

466 for training and 116 for testing. The ground truth pseudo-HDR images were generated with the static reference set usingMertens’ et al. [4] algorithm. Additionally, we also test ourmodel on Sen [18], Karaduzovic [3] and Tursun [8] datasets.

B. Quantitative evaluation

1) Comparison with SoTA methodsWe perform quantitative comparisons of our method with

two other, well-known state of the art methods: 1) Hu et al.[17] 2) Ma et al. [5]. We compare against these methods usingtwo full-reference image quality assessment metrics: PSNRand SSIM [23] in Table I. The results for these methods aregenerated with the default parameters specified by the authors.We compare the results in two categories: image pair fusionand triple-image fusion. From Table I, we observe that Hu etal. [17] performs better in terms of PSNR than our proposedmethod for triple-image fusion. However, our method achievesa higher SSIM score in the same category. For extreme imagepair fusion, our method outperforms both Hu et al. [17] andMa et al. [5] in terms of PSNR and SSIM. We did not comparewith Sen et al. [18], Kalantari and Ramamoorthi [1], and Wu etal. [2], as they generate HDR images directly. Since the groundtruth fusion strategy for HDR imaging is different than thatof MEF (triangular weighting for HDR and Mertens et al. [4]for MEF), it might be inaccurate to compare MEF and HDRmethods quantitatively.

In Table III, we benchmark our method for HDR imagedeghosting against Sen et al. [18], Kalantari and Ramamoorthi[1], and Wu et al. [2] on the test set of UCSD dataset. ThePSNR score for our model is slightly lower as a consequenceof performing mean and max. Thus, in order to prove theeffectiveness of our refinement net (N2) model, we modifiedour fusion net (N3) model to reflect the same setting as [2].That is, weights for the fusion model (N3) is not shared amonginput images. We train this model for a fixed number of inputs(three images in UCSD dataset) by using a separate encoderfor each of the inputs and concatenating the feature mapsinstead of aggregating them using max/mean and obtainedbetter PSNR values than [2] with a much shallower network.This modified model is referred to as untied weights in TableIII, while the original model is referred to as tied weights.

TABLE II: Fusion scalability analysis: In this table, wereport the no-reference IQA metric MEF SSIM scores for threefusion strategies, trained to fuse three input images only andtested to fuse any arbitrary number of input images withoutretraining for them. See IV-2 for more details.

MEF SSIM 2 3 5 7

Mean 0.9768 0.9809 0.9704 0.9554Max 0.9787 0.9810 0.9732 0.9572Mean + Max 0.9774 0.9823 0.9757 0.9644

Mertens et al. [4] 0.9826 0.9844 0.9787 0.9701

TABLE III: Quantitative comparison with state-of-art HDRdeghosting algorithms on UCSD dataset.

Methods/Metrics PSNR(dB) HDR-VDP-2Hu et al. [17] 35.61 60.12Sen et al. [18] 40.43 62.11Kalantari and Ramamoorthi [1] 42.70 66.64Wu et al. [2] 41.65 67.96Proposed method (Tied weights) 40.47 66.80Proposed method (Untied weights) 41.80 66.52

C. Qualitative evaluation

We perform qualitative comparisons with Kalantari andRamamoorthi [1] and Wu et al. [2] in Fig. 1 and Fig. 5. Asseen in the highlighted area, the result generated by Kalantariand Ramamoorthi [1] still has artifacts due to optical flowmisalignment. Additionally, their result tends to consist ofsmall speckles of artifacts with erroneous colors as shown inFig. 5. One reason for such artifacts could be the fact thatKalantari and Ramamoorthi [1] use a single CNN networkto perform both deghosting and exposure fusion. This maycause the network to be confused between moving objectsand highly saturated regions. In contrast, we perform bothoperations separately with different CNNs. Thus, our networkis able to distinguish between moving objects and saturatedpixels. In Fig. 1 and 5, we can observe ghosting artifact inresults generated by Wu et al. [2] due to object motion.

In Fig. 6, we compare our results with Hu et al. [17],Sen et al. [18] and Ma et al. [5] for extreme image pairfusion with images from Tursun’s dataset [8]. The resultsare generated with the under-exposed image as the reference,and are displayed after tonemapping using Adobe Photoshop.It can be observed that Hu et al. [17], Sen et al. [18]and Ma et al. [5] hallucinate details in regions of heavysaturation. The artifacts are generated due to incorrect pixelcorrespondence during the alignment phase. Comparatively,though the aligned images consist of artifacts, our refinementnetwork successfully identifies them as incorrect matches, thusavoiding such artifacts in the final result.

D. Running time

In Table IV, we compare the execution times of five stateof the art methods for triple-image fusion with 1260×800resolution. Compared to image synthesis approaches such asHu et al. [17] and Sen et al. [18], our method achieves aspeed up of upto 54×. It should be noted that for Kalantariand Ramamoorthi [1], their non-deep optical flow alignment

Page 6: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

(a) Input (b) Kalantari and Ramamoorthi [1] (c) Wu et al. [2] (d) Proposed

Fig. 5: Fused results by Kalantari and Ramamoorthi [1] and Wu et al. [2] for test images from our dataset. Results aredisplayed after tonemapping with Adobe Photoshop. Ghosting artifacts introduced by Wu et al. [2] can be clearly observed onthe subject’s face. Kalantari and Ramamoorthi [1] introduces artifacts near the contour of the tree in the top-right region.

(a) Hu et al. [17] (b) Sen et al. [18] (c) Ma et al. [5] (d) Proposed method

Fig. 6: Image pair fusion on the METU dataset [3].

method takes about 58 seconds to align three images. We plotPSNR against execution time in Fig. 9 for triple-image fusion.

IV. DISCUSSION

1) With and Without Optical Flow:As shown in Table I, we performed experiments to de-

duce the efficacy of appending optical flow as input to therefinement network. The results indicate that the network isable to perform better when flow is present as auxiliary input.We hypothesize that the flow information helps the networkidentify pixels with occlusion and non-smooth flow regions.

2) Scalability Analysis:

In Table II, we report the no-reference image quality metricMEF-SSIM performance of three feature-agglomeration tech-niques, trained to fuse three input images only and tested tofuse any arbitrary number of input images without retraining.We trained three models using the Mean, Max and Mean +Max feature merging techniques with three input images (seesection II C for details). We tested these models for sequenceswith 2, 5, and 7 input images. The number of streams in thefeature extractor are replicated N times for N input images. Forthe purpose of comparison, we have included the scores forthe ground truth images as well. Mean + Max fusion strategyscales best for a higher number of input images. It should be

Page 7: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

TABLE IV: Execution time comparison between different methods for an image resolution of 1260 × 800.

Algorithm Hu et al. [17] Sen et al. [18] Ma et al. [5] Kalantari andRamamoorthi [1] (GPU)

Wu et al. [2](GPU)

Proposed(GPU)

Proposed(CPU)

Time (in seconds) 330.65 318.57 12.37 61.53 0.29 0.32 6.66Speed up factor 56× 54× 2× 192× 0.9× - -

(a) Input (b) Ma et al. [5] (c) Proposed

Fig. 7: Comparison with Ma et al. [5] for an input sequencefrom Sen’s dataset [18]. Ma et al. [5] introduces structuralartifacts in the facial region. Best viewed in color whenzoomed in on an electronic display.

(a) EV−2 (b) EV0

(c) Result with EV0 as reference (d) Result with EV−2 as reference

Fig. 8: Choosing a reference image with saturated pixels leadsto artifacts, as highlighted in the red box of (c). To overcomethis problem, one can choose the image with the least numberof saturated pixels as the reference image, as shown in (d).

noted that since the network was trained with Mertens et al.[4] fused images, it can not outperform it.

3) Choice of Reference Image:An important parameter for the alignment stage is the choice

of the reference image. Choosing an over-exposed referenceimage can lead to incorrect reconstruction if the overexposedpixels in source image are occluded, since the network has noway of guessing the correct value for clipped pixels. On theother hand, choosing the most underexposed image can lead tonoise when it is mapped to the higher exposure image. Thus,it is desirable that the reference image is not saturated. We

330

126.6

0.32

PSNR

Time

0.1

1

10

100

1000

27 28 29 30 31

Hu13 Ma17 Ours-CPU Ours-GPU

Fig. 9: PSNR vs Execution Time scatter plot for triple-imagefusion.

find that choosing image with least number of saturated pixelsin the HSV color space provides a balanced middle ground,and more complex techniques such as color quantization andhistogram analysis can potentially produce better results.

V. CONCLUSION

We have proposed a CNN-based fast, scalable imagedeghosting method that generates visually pleasing resultseven for extreme exposure inputs. In our method, we com-pensate for background and foreground motion using opticalflow. The artifacts introduced by optical flow are correctedusing a refinement network. Finally, the aligned images arefused using a novel scalable exposure fusion CNN module.The proposed method can fuse an arbitrary number of im-ages without the need for re-training. The proposed methodshows state-of-the-art performance with low computation time,achieving a speed up of around 54× compared to existingmethods.

The current model is trained with ground truth generated byMertens et al. [4]. An interesting avenue of research would beto train the model with an NR-IQA metric such as MEF-SSIM.Another potentially useful idea would be to extend the currentmodel to generate HDR videos. To do so, one has to taketemporal structural and color coherency into account. Modelssuch as 3D-CNN can be helpful in such a scenario.

ACKNOWLEDGMENT

The authors would like to thank Nima Kalantari for hishelpful suggestions and clarifications.

REFERENCES

[1] N. K. Kalantari and R. Ramamoorthi, “Deep high dynamic range imag-ing of dynamic scenes,” ACM Transactions on Graphics (Proceedingsof SIGGRAPH 2017), vol. 36, no. 4, 2017.

Page 8: A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

[2] S. Wu, J. Xu, Y.-W. Tai, and C.-K. Tang, “Deep high dynamic rangeimaging with large foreground motions,” in European Conference onComputer Vision, 2018, pp. 120–135.

[3] K. Karaduzovic-Hadziabdic, T. J. Hasic, and R. K. Mantiuk, “Multi-exposure image stacks for testing HDR deghosting methods,” 2017.

[4] T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion,” in 15thPacific Conference on Computer Graphics and Applications. IEEE,2007, pp. 382–390.

[5] K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang, “Robust multi-exposure image fusion: a structural patch decomposition approach,”IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2519–2532,2017.

[6] K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deepunsupervised approach for exposure fusion with extreme exposure imagepairs,” in IEEE International Conference on Computer Vision (ICCV),2017, pp. 4724–4732.

[7] P. Sen, “Overview of state-of-the-art algorithms for stack-based high-dynamic range (HDR) imaging,” Electronic Imaging, vol. 2018, no. 5,pp. 1–8, 2018.

[8] O. T. Tursun, A. O. Akyuz, A. Erdem, and E. Erdem, “The state of theart in HDR deghosting: a survey and evaluation,” in Computer GraphicsForum, vol. 34, no. 2. Wiley Online Library, 2015, pp. 683–707.

[9] G. Ward, “Fast, robust image registration for compositing high dynamicrange photographs from hand-held exposures,” Journal of GraphicsTools, vol. 8, no. 2, pp. 17–30, 2003.

[10] G. Tzimiropoulos, V. Argyriou, S. Zafeiriou, and T. Stathaki, “RobustFFT-based scale-invariant image registration with image gradients,”IEEE TPAMI, vol. 32, no. 10, pp. 1899–1906, 2010.

[11] T. Grosch, “Fast and robust high dynamic range image generationwith camera and object movement,” Vision, Modeling and Visualization,RWTH Aachen, pp. 277–284, 2006.

[12] O. Gallo, N. Gelfandz, W.-C. Chen, M. Tico, and K. Pulli, “Artifact-freehigh dynamic range imaging,” in ICCP. IEEE, 2009, pp. 1–7.

[13] E. Reinhard, W. Heidrich, P. Debevec, S. Pattanaik, G. Ward, andK. Myszkowski, High dynamic range imaging: acquisition, display, andimage-based lighting. Morgan Kaufmann, 2010.

[14] L. Bogoni, “Extending dynamic range of monochrome and color imagesthrough fusion,” in 15th International Conference on Pattern Recogni-tion, vol. 3. IEEE, 2000, pp. 7–12.

[15] S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High dynamicrange video,” in ACM Transactions on Graphics (TOG), vol. 22, no. 3.ACM, 2003, pp. 319–325.

[16] H. Zimmer, A. Bruhn, and J. Weickert, “Freehand HDR imaging ofmoving scenes with simultaneous resolution enhancement,” in ComputerGraphics Forum, vol. 30, no. 2. Wiley Online Library, 2011, pp. 405–414.

[17] J. Hu, O. Gallo, K. Pulli, and X. Sun, “HDR deghosting: How to dealwith saturation?” in IEEE Conference on Computer Vision and PatternRecognition, 2013.

[18] P. Sen, N. K. Kalantari, M. Yaesoubi, S. Darabi, D. B. Goldman, andE. Shechtman, “Robust patch-based HDR reconstruction of dynamicscenes.” ACM Trans. Graph., vol. 31, no. 6, p. 203, 2012.

[19] O. Gallo, A. Troccoli, J. Hu, K. Pulli, and J. Kautz, “Locally non-rigidregistration for mobile HDR photography,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops,2015, pp. 49–56.

[20] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-net: CNNs for opticalflow using pyramid, warping, and cost volume,” in IEEE Conference onComputer Vision and Pattern Recognition, 2018.

[21] M. D. Grossberg and S. K. Nayar, “Determining the camera responsefrom images: What is knowable?” IEEE Transactions on Pattern Anal-ysis & Machine Intelligence, no. 11, pp. 1455–1467, 2003.

[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in International Conference onMedical image computing and computer-assisted intervention. Springer,2015, pp. 234–241.

[23] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,” IEEETransactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.